Jennifer Sturdy–BITSS Program Advisor
Science, my lad, is made up of mistakes, but they are mistakes which it is useful to make, because they lead little by little to the truth.
It seems fitting that I read a book which references this Jules Verne quote just before our BITSS panel on research transparency and reproducibility at the 68°North Conference on Behavioral Economics, hosted by The Choice Lab at Norwegian School of Economics. Not only because, on our way to the conference location, it felt as though we had journeyed to the edge (not the center) of the Earth, but also because the theme that ran through the panel reminded us that the goal of our research is to scratch away at unknowns as we work toward new truths, and we will probably make some mistakes along the way.
Determining what is truth
Dan Benjamin (USC) began the panel with his presentation on “The Strength of Evidence from Statistical Significance and P-values.” The main takeaway from this discussion: it’s no wonder we see failures to replicate results when we consider that we have a “strength of evidence” problem. The argument is simple: we reject the null hypothesis with a p-value of 0.05 or lower and treat this as strong evidence of a true effect. However, a p-value of 0.05 only corresponds to 3:1 evidence in favor of a true effect—far weaker than most researchers realize. Moreover, we often ignore power and prior odds. If a study has low power, Gelman and Carlin (2014) argue we can see statistically significant results that are misleading in terms of direction and magnitude and are therefore unlikely to replicate. And if the prior odds are low, then even statistically significant findings are unlikely to be true effects. These are big problems. One way to start chipping away at them, as suggested by Dan and other thought leaders recently, is to consider prior odds and lower the threshold for statistical significance from 0.05 to 0.005 to increase the strength of the evidence. Check out the preprint yourself or its forthcoming publication in Nature Human Behavior!
Bottom line: We need to acknowledge that we have a “strength of evidence” problem. To address it, we need to start reporting – in pre-registration, pre-analysis plans (PAPs), and final write-ups – on prior odds, significance thresholds, anticipated effect sizes, and power calculations so any reader can understand the strength of the evidence. We should also challenge ourselves to think critically about what should be considered acceptable significance levels.
Finding the way to the truth
Johannes Haushofer (Princeton) then delivered an excellent discussion on “Pre-Analysis Plans in Behavioral and Experimental Economics.” The slidedeck provides a useful overview of what is expected in a PAP and the benefits from the perspective of a Social Planner – a decision-maker who attempts to achieve the best result for all parties involved. He dispensed quickly with several arguments regarding the costs of writing PAPs, including that PAPs take too long and limit exploratory analysis. However, he discussed one cost we rarely reflect on.
One (potentially) large cost to writing a PAP is if the author includes an incorrect specification, meaning a legitimate mistake describing what would be tested and how. If an overzealous reviewer or editor doesn’t allow for correcting such a mistake – well documented of course! – there is a risk that the paper won’t be published under the assumption that the author p-hacked to find the results. One certainly hopes we are allowed to correct mistakes, and the strength of the evidence speaks for itself. Militant use of PAPs isn’t anyone’s goal (I don’t think). Instead, PAPs are supposed to lay the groundwork for why and how researchers approach their research questions. The discussion benefited greatly from audience participation who raised the question – when and why do we use PAPs?
Bottom line: PAPs are meant to be the guide for how a researcher expects to find the truth – laying out the methods, hypothesis, etc. However, there is no one size fits all approach here. What belongs in a PAP, and when the researcher develops one, depends on the purpose. If PAPs are only intended to prevent p-hacking, then writing them just before data analysis starts serves the purpose. If PAPs are to help inform study design, then writing them sooner serves a purpose. The goal in all of these cases is the same – clearly document what is to be done and why – including corrections to mistakes, evolution in thinking, etc. so we can work toward some new truth.
Ensuring it’s valuable truth
Matthew Rabin (Harvard) closed the panel with his discussion from an Implications-Oriented Economic Theorist’s Perspective. Here, he reminded the audience to consider how research outputs inform a production function: theorists supply restrictive and widely applicable formal models, while experiments supply data into tractable models with economic consequences and empirical implementability. Each informs the other, and developing strong theories benefit when inputs are valuable, credible truths.
While there were several great points throughout the discussion, one highlight relates to replication: While we might be concerned about a lack of replication in some cases, let’s think about what is valuable. First, consider there are likely file drawers full of “failed non-replications” that we can’t account for when we consider what replicates and what doesn’t. Second, consider the goal of replication – if researchers want to test the statistical truth of an original study, they must be very precise with how they follow the original study design and analysis – and be held accountable for pre-registration, pre-analysis plans, etc. in the same way as the authors of the original study. But consider what matters more for the value of those results: If results only hold in the exact original situation and go away or reverse in other contexts, how valuable of a truth are they? What may matter more is if results hold for inexact replications – meaning they replicate even if there are adjustments to context.
Bottom line: Even theorists need to stop “t-hacking” and start revealing unrealistic implications and applications of their theories that they have discovered, instead of just reporting the ones that put their theories in good light.
Researchers make mistakes – overestimating the strength of the evidence (or theory), developing hypotheses that aren’t supported by data, incorrectly stating hypotheses, and incorrectly measuring outcomes, just to name a few – on the way to learning new truths. Rather than hide those mistakes in file drawers, through selective reporting, or evolving specifications, we should be as open to sharing them as we are to sharing research results that support our original, or evolving, ideas of the truth.
This is why we see growing demand for and use of pre-analysis plans, pre-registration, and more transparent disclosure and reporting. Because mistakes also lead us to new learning – and as long as they are well documented and explained, it’s for the community to determine the strength of the evidence, particularly with improved statistical and reporting requirements. As mentioned in Matthew’s presentation: Let transparency facilitate community judgment.
In this spirit, The Choice Lab and Center for Empirical Labor Economics (CELE) are joining forces to establish a Norwegian Center of Excellence – Centre for Experimental Research on Fairness, Inequality, and Rationality (FAIR), which will partner with BITSS to establish principles and practices for transparent, reproducible research produced by the center. We are excited to be a part of this and look forward to many future opportunities for engaging discussions like this panel!