Garret Christensen, BITSS Project Scientist
Several great working papers on transparency and replication in economics have been released in the last few months. Two of them are intended for a symposium in The Journal of Economic Perspectives, to which I am very much looking forward, and are about pre-analysis plans. The first of these, by Muriel Niederle and Lucas Coffman, doesn’t pull any punches with its title: “Pre-Analysis Plans are not the Solution, Replication Might Be.” Niederle and Coffman claim that PAPs don’t decrease the rate of false positives sufficiently to be worth the effort, and that replication may be a better way to get at the truth. Some of their concern about PAPs stems from concerns about the assumption “[t]hat one published paper is the result of one pre-registered hypothesis, and that one pre-registered hypothesis corresponds to one experimental protocol. Neither can be guaranteed.” They’re also not crazy about design-based publications (or “registered reports“). They instead offer a proposal to get replication to take off, calling for the establishment of a Journal of Replication Studies, and for researchers to start citing replications, both positive and negative, whenever they cite an original work. They claim if these changes were made, researchers might begin to expect to see replications, and thus the value of writing and publishing them would increase.
Another working paper on PAPs in economics, titled simply “Pre-Analysis Plans in Economics” was released recently by Ben Olken. Olken gives a lot of useful background on the origin on PAP and discusses in detail what should go into them. A reference I found particularly informative is “E9 Statistical Principles for Clinical Trials,” the FDA’s official guidance for trials, especially section V on Data Analysis Considerations. Obviously a lot of the transparency practices we’re trying to adopt in economics and social sciences come from medicine, so it’s nice to see the original source. He compares the benefits: increasing confidence in results, making full use of statistical power, and improving relationships with partners (governments or corporations that may have vested interests in the outcomes of trials), with the costs: complexity and the challenge of writing all possible papers in advance, PAPs pushing towards simple, less interesting papers with less nuance, and reducing the ability to learn ex-post about your data. He cites Brodeur et al to say the problem of false positives isn’t that large, and that with the exception of the trials involving parties with vested interests, the costs outweigh the benefits.
We at BITSS love Lucas and Coffman’s bit about replication, but I think we’re more positive about PAP than either of the above working papers. I need to re-read the papers more closely, but one thing I’ll say is that there is one thing that everybody seems to agree on: when you’re doing a big, expensive field trial, and there’s a government or corporation or somebody with a vested interest involved, a PAP is a great idea. For more discussion, read Berk Ozler’s great WB blog post about Humphreys, de la Sierra and van der Windt’s 2013 paper on PAPs. Or, assuming our session at the AEA meeting gets accepted, come listen to us discuss it with some of these folks in person.
Lastly, there’s a new working paper on replication from Michael Clemens at CGD, and an associated blog post. Clemens says that a lot of the debates in economics about “failed replications” would evaporate if we all actually meant the same when we said the word “replication.” This taxonomy has been attempted before, most notably by Daniel Hamermesh (gated, NBER WP). Clemens’ suggestion is that we break things into two categories, each with two sub-categories: replication, with subgroups verification and reproduction, and robustness, with subgroups reanalysis and extension. Basically, a lot of work that has been called replication isn’t. If you’re running the same code on the same data as someone else and you don’t get the published result and find an objective error in the code, that’s a failed replication. If you run different code on the same data (reanalysis) or run the same code on a different population (extension) and find that significance goes away, that’s something about which reasonable people might disagree, and not a failed replication.
My first reaction is that I whole-heartedly endorse the attempt for clarity, because there is absolutely a lot of contradictory use of these terms. I do have a concern about how well these terms will translate to other fields. If a psychologist redoes the same experiment with 100 college kids that someone else did with a different sample of college kids, that feels like less of a robustness/extension to me, and more like replication. (But maybe that’s just because that’s what I’m used to hearing psychologists calling it.) Clemens cites the example of bogus cold fusion as a failed replication, specifically a failed reproduction test, since the researchers who showed that cold fusion was bogus weren’t using the same atoms of palladium as the original researchers, so technically, that’s not identical actions with identical materials. I guess that’s not to say that I disagree with Clemens’ proposed taxonomy, I just think it may be an uphill (but worthy) battle, and might possibly be viewed differently in different disciplines.
