The BITSS Take on "wormwars" and Replication Writ Large

Garret Christensen–BITSS Project Scientist

If you’re a development economist, or at all interested in research transparency, I assume you’ve heard about the recent deworming replication controversy. (If you were lucky enough to miss “wormwars,” you can catch up with just about every thing with this one set of links on Here at BITSS, we’re obviously interested in this story, since we strongly support replication in general, and because the original paper being replicated was written by BITSS director Ted Miguel, along with Harvard economist Michael Kremer. I won’t go into the details or tell you what to think, as you should just read the original paper, the pure replication, the statistical replication, the response by Miguel and Kremer, the reply to that by the replicators, the Buzzfeed article, thoughts from Chris Blattman and Berk Ozler, thoughts from the editors who published the replications, and thoughts from 3ie who funded and facilitated the replications, all of which are linked by storify. (Make sure you don’t just read the replications or sensationalist news articles without reading the reply by the original authors.)

I believe mass deworming is cost-effective at improving health and increasing school attendance, but I don’t think I’m a particularly objective judge of things since (A) I’m trained as an economist so I tend to agree with other economists, (B) I’ve spent the better part of the last ten years employed by Miguel and/or Michael Kremer, including on research following up on the subjects of the original deworming study, and (C) I’m human.

So what will I attempt to add?

1) BITSS strongly supports data and code sharing.

This isn’t Miguel’s first time at the replication rodeo. For years he’s been posting his data publicly on his website, and it’s now on Harvard’s Dataverse too. Researchers have used the data to re-evaluate a paper using rainfall shocks as an instrumental variable for GDP and its effect on civil conflict (original, comment, reply) and a paper on climate change and conflict (originals (1,2), comment, reply). There may be others I don’t know of.

When you share your data, it’s easy for people to replicate your work. It may be messy and complicated when you share your work and others notice (minor?) errors in your coding, but compare that to the unknown of all the data and code that isn’t shared. Who knows what’s going on? Journals should require data sharing. Several of the top economics journals, including Econometrica, where the original worms paper was published, now require data and code sharing. Not all is perfect in economics (the Quarterly Journal of Economics has no such policy) but I’d say we’re doing a relatively good job as a discipline. The more universal this requirement becomes, the better science becomes. BITSS was part of the team that produced the Transparency and Openness Promotion Guidelines, which we encourage all journals to seriously look into.

2) BITSS strongly supports replication.

I think it’s unfortunate that academics aren’t rewarded for doing replication, and they especially aren’t rewarded for successful replications. When have you ever even seen a successful replication get published? Most of the published replications of which I am aware started as part of an assignment as a graduate course. For example, that’s how Herndon et al’s replication of Reinhardt and Rogoff’s paper on debt and growth started. But what about the twenty or thirty other papers from that class? Did they replicate without error? Are we to assume from Herndon’s experience that all economics is wrong, or that 3.3% of economics is wrong? I think the answer is much closer to 3% than 100%, but see caveat (A) above.

I think the solution is to publish replications, both positive and negative. I know it’s not human nature to think things you’ve heard before are interesting, but if editors were to start giving some non-zero amount of journal space to papers that confirm previously published results, that could change incentives dramatically. Currently replicators only benefit from finding an error, so they have incentive to manufacture one. In both randomized trials and in observational work, researchers have flexibility in statistical decision making. If they make reasonable choices A through Z, and get significant results, then replicators have incentive to search every combination of A through Z that produces insignificant results, and  report only the combinations that contradict the original claim. If a positive replication had a snowball’s chance of getting published, replicators would be less incentivized to assume the worst about the original researcher.

3) We should be clear exactly what we mean by “replication.”

Different people use the same word to mean different things. Are you running the original code/analysis on the original data to check for errors? That’s one thing. Are you running different code (different regression specifications) on the same data? That’s something else. In wormwars parlance, these are called “pure replication” and “statistical replication,” respectively. The same code on different data? Different code on different data? Also different. You can write this off as just semantics, but I think you do so at our collective peril. I’m no biologist, but I date one, and I’m pretty sure she’s really careful about what plants are named.

Michael Clemens has a nice working paper on this where he suggests the following taxonomy:

  • Same analysis, same data: verification
  • Same analysis different data: reproduction
  • Different analysis, same data: reanalysis
  • Different analysis, different data: extension

4) Writing a pre-analysis plan could save you replication headaches in the future.

The deworming study was one of the first randomized controlled trials (RCTs) in development economics, and Miguel and Kremer didn’t write a pre-analysis plan. If they had, they could have insulated themselves against some of the critiques in Davey et al., because some alleged flexibility in analysis would have been removed. (I’m thinking of the time of intervention delivery and debate over assignment to treatment or control group for the ITT estimate.) The social sciences are improving, and there are now several places you can pre-register your trial and include a pre-analysis plan (AEA, EGAP, RIDIE, OSF). Miguel now has several papers involving RCTs with pre-analysis plans, including Casey et al. that shows how the authors could have produced any answer they wanted had they not tied their hands with a pre-analysis plan. Writing a pre-analysis plan for observational work isn’t quite as straightforward, but this example by David Neumark regarding the minimum wage shows that it is in fact possible.

5) Science is hard. Interdisciplinary science is hard. Replication is hard. Let’s do more of it.

After getting my PhD in economics, I worked on an interdisciplinary (economics/epidemiology/public health) water, sanitation, and hygiene RCT in Kenya called WASH Benefits. I went in assuming that since I knew how economists run RCTs, I knew how everyone ran RCTs. It’s the gold standard, right? So the statistics should be identical–that’s what “standard” means. It’s hyperbole, but I now often say that I left thinking that different disciplines actually use mutually exclusive assumptions. This makes the cynic in me want to quit and go hiking, but that’s really not the appropriate reaction.

Different disciplines may see things differently, and different researchers may see things differently, and that’s how science works. But that shouldn’t make you shy away from sharing your data. In fact, just the opposite is true. Make sharing universal, develop clear standards that require both original research and replication be done transparently, make everyone (original researchers and replicators) share their data and code, and publish replications; then we’ll have a better idea which research actually holds up.