Advancing communications measurement and evaluation

Communications Measurement, P-Hacking, and the Replicability Crisis

back-to-future-doc-brown

Our current social climate has science playing defense. There are now in particular two blots on its pristine lab coat that have received ample unfavorable press: the replicability crisis and p-hacking. This article reviews the sources of these problems, and their relevance to communications measurement. The short answer: If you are going to gather and analyze data, you need to understand the limits of statistical inference and the sources of spurious relationships.

xkcd-significantP-hacking. It’s a bit of a shameful secret. But every social science researcher—and that includes us measurement types—has given in to temptation. Just think back to that pile of data you were examining, and for which your analysis yielded an almost significant result. So then you sliced it another way, maybe threw out some outliers. Or maybe you restated the question you wanted to answer to fit what the data was telling you. You dredged around in the data until you found something to crow about. You dirty p-hacker you.

P-hacking, also known as data snooping or data dredging, is a temptation born of the nature of statistical inference. When you use statistical tests on data to look for relationships between variables (e.g., did your tweets result in more traffic to your website?), you are often going to find that some things are related to others. The probability that you’d find such a result by chance is the famous p of statistics. P<.05—less than a five percent chance of error— is the measure of statistical success in the social sciences. And it’s often the measure of whether or not John or Jill Researcher gets published, promoted, or tenure. P=.05 rules. But it can be deceptive more often than you’d think. There are plenty of ways to hack it.

Correlations gone wild

Here’s the thing about a pile of data. If you look at it in enough different ways, try enough tests on this and that, sooner or later you and your statistical shotgun are going to identify some kind of a significant effect. But sometimes that effect appears just by chance. Such spurious results are a dime a dozen—or actually one out of twenty, if you’re using p=.05 for a statistical confidence level. For instance, if you correlate enough trends, you will find high correlations by chance. See “Spuriouser and Spuriouser: It’s Raining Correlations” for some examples.

It’s only natural to want to find a story in the data. Suppose you’re trying to figure out what went wrong with your PR program, and you have collected a pile of data. Any researcher with an ounce of curiosity is going to look at the data this way and that, to try and figure out what is going on. Seen in this light, a little p-hacking is to be expected.

But if you are using the scientific method to build on a body of existing knowledge, then you are probably doing hypothesis testing. Which means you generate a single hypothesis first, then use your data to test it. If you develop several alternative hypotheses after you take a look at the data, well, that sort of fishing is p-hacking. And the reason why it’s frowned upon is, once again, that if you test enough alternative hypotheses, then sooner or later you are going to find a significant result just by chance.

Here’s a fun do-it-yourself p-hacking exercise: Hack Your Way to Scientific Glory. The data and instant analysis are provided, all you have to do is choose which variables to include.

Chance is statistics’ superpower—but also its kryptonite

Statistics is a marvelous tool. Once you make an educated assumption about the nature of the distribution(s) of your variables, you can use statistics on a (data) sample to reveal how likely it is that the data is from one population or another, or that the variables are related. That’s the superpower; you know how likely it is that you would get such a result by chance. But when you start doing tests over and over, that’s when chance becomes your downfall: sooner or later you are going to get a spurious result.

Or, put another way as Christie Aschwanden says in her excellent introduction to p-hacking in FiveThirtyEight: “It’s a lot easier to get a result than answer a question… what scientists really want to know is whether their hypothesis is true, and if so, how strong the finding is… you can think of the p-value as an index of surprise. How surprising would these results be if you assumed your hypothesis was false?”

P-hacking measurement?

What potential pitfalls does p-hacking hold for communications measurement? Experienced researchers, and anyone who has studied experimental methods and statistics, know how to avoid it. But it’s a possible problem for the novice, who, armed with Excel and a pile of data, can fish for correlations until they find a pond full of them. Tom Webster’s cautionary post on social media data dredging focuses specifically on email open rates and days of the week.

The march of technology is beginning to provide automated analysis tools which will search through a mountain of data for related variables. (Some claim that in the future we’ll be forced to rely on such automated analysis, as there is just too much data for mere humans to crunch through without help.) Especially in measurement, where we encourage every communications practitioner to collect and analyze data to improve their programs, it may become common for stats newbies to dredge around in their data without understanding the potential for identifying erroneous effects. Does Excel need a warning label?

The replicability crisis, or, science and the one hit wonder

P-hacking is one contributing factor to the larger problem of the replicability crisis. That’s the name given to recent attention to an apparent lack of reproducibility in social science research. In medicine, education, psychology, and a host of other fields, it has become apparent that just because one study shows a relationship between two variables, that doesn’t mean another study will find the same relationship. The prospect that science might somehow be broken feeds the current social climate of, well, climate deniers and others who would like to adjust reality to their own agendas.

The replicability crisis received major attention a year and a half ago with the publication of “Estimating the reproducibility of psychological science” in Science. (Read a summary here, and some more in-depth discussions here and here.) For a less academic and more humorous take, John Oliver provides a great video summary, with emphasis on the media’s thirst for breakthough findings. The upshot is that a group of researchers tried to replicate recently published psychological studies, and were able to do so only 40 percent of the time.

Shocking! …or is it? As with p-hacking, not really, especially if you are at all versed in experimental methods and the way research gets published and publicized. If so, you’ll know that there are plenty of sources of bias that can skew results. And that one set of experimental results often needs a repeat performance to prove itself.

What causes it?

“Replicability crisis” is a bit of a misnomer. While the end result is that research results are replicable less often than hoped (or hyped), there are a variety of causes, mostly related to shoddy methodology, human bias, and institutional incentives. Their combined effect is to make social science appear less robust, and more prone to error. As well as p-hacking, they include:

  • The 95 percent confidence level—19 out of 20 are great odds for a gambler, or to bet your career on, but not many would bet their life on them.
  • Cherry-picking results—If the same experiment is tried enough times, sooner or later someone is going to get a statistically significant result. Never mind all the unreported insignificant results.
  • Publication bias—Everyone wants to hear what did happen. What didn’t happen is not so interesting, so the failures don’t get published, and don’t get noticed.
  • People don’t often bother to try to reproduce results. It’s old news. As John Oliver says, “There’s no reward for being the second person to discover something in science.”

So, what’s this have to do with communications measurement? It is inspirational to think of measurement as social science research—with each measurement program its own little experiment. But there’s a big difference between everyday evaluation of business comms and the kind of high-powered research presented at the annual International Public Relations Research Conference, or published in the Research Journal of the Institute for Public Relations. Many of the causes of the replicability crisis have more to do with proper social science research than with measurement.

In our Measurement Life interview Don Stacks called the replicability crisis “a crisis in a teapot” and went on to champion proper research methods:

“Let’s start by looking at how we do research. First, an approach leads to a typically deductive theory. That theory suggests the methodology for answering research questions and hypotheses. The methodology then sets the analytics, which in turn sets the evaluation of the initial RQs or Hs. Further, since it is theory-based, our research can be reviewed. Since the methodologies are scientific, we can examine them for methodological and analytical problems.”

Yet he also implies that the replicability crisis is enhanced by the layperson’s misunderstanding of scientific methods:

“There is a mistaken assumption that results hold over time, which is simply not true. Each population or sample differs from the next (or preceding). This means that there will always be error, much like a car’s speedometer, which can be off by 1-5% at any given time.”

We all know that in the business environment there are many pressures on the use of data and its analysis. Business is business, and results that don’t support business goals can be easily set aside. (See the article “6 Measurement Pros Explore the Value of Failure: Everybody Wants to Learn, but Nobody Likes to Make a Mistake.”) When it comes to biased measurement results, it may well be that the problems that have lead to the replicability crisis are minor compared to the pressures of the business environment.

###

Thanks to Fake Plastic Trees for the Doc Brown image, and to xkcd for the comic.

Bill Paarlberg
Visit me on

Bill Paarlberg

Bill Paarlberg co-founded The Measurement Standard in 2002 and was its editor until July 2017. He also edits The Measurement Advisor newsletter. He is editor of the award-winning "Measuring the Networked Nonprofit" by Beth Kanter and Katie Paine, and editor of two other books on measurement by Katie Paine, "Measure What Matters" and "Measuring Public Relationships." Visit Bill Paarlberg's page on LinkedIn.
Bill Paarlberg
Visit me on
728 Ad Block

Related posts

728 Ad Block