Guest post by Arnaud Vaganay (Meta-Lab)
This post is the first of two dedicated to the reproducible interpretation of empirical results in the social sciences. Read part 2 here.
If you are a regular reader of this blog, chances are high that you know all about the ‘reproducibility crisis’ that has struck many fields of science over the past few years. In my experience, there is still a lot of confusion as to what it really means. In its narrowest sense, the reproducibility crisis refers to the inability, or great difficulty, that many researchers face when they attempt to reproduce a graph, a table or a single statistic using the same data and code as in the original study. A broader definition of irreproducibility refers to the difficulty in reproducing results using the same data as in the original study but the researcher’s own understanding of the analysis rather than the exact code. Regardless of the definition (broad or narrow), we should call this type of reproducibility “analytical/analytic reproducibility”, as suggested by LeBel et al.
Although the analysis of causal mechanisms, correlations and trends is an essential part of research, it is only the first of two analyses that social scientists are expected to deliver. The second analysis involves the interpretation of these results in the light of existing theories and results from previous studies. This is typically done in the ‘discussion’ section of the manuscript. I think it is safe to say that the latter type of analysis does not get nearly as much attention – from investigators, peer reviewers and readers – as the empirical analysis. As a result, discussion sections are often erratic and almost never reproducible. Rather than analysing the robustness of a theory or result based on clear criteria, discussions tend to be used to justify the new findings based on whatever study will support the authors’ claims. Such an approach is problematic for a number of reasons. Not only is it prone to confirmation bias, it also disregards the cumulative nature of science, i.e. the fact that “no single experiment, however significant in itself, can suffice for the experimental demonstration of any (natural) phenomenon” (words attributed to Ronald Fisher). In addition, it fails to properly manage the expectations of policymakers, beneficiaries and the media by neglecting to put these results in context. These groups need to understand that what worked ‘here’ may or may not work ‘there’ and if it doesn’t, then the next logical question is why.
Assuming that each study involves a mix of replication and innovation, a useful discussion is thus one that:
- Compares and contrasts the new results with results from previous studies, bearing in mind that the closer the replication, the stronger the expectation to find a similar result; and
- Assesses the plausibility that any major discrepancy is due to the specificity (intended or unintended) of the intervention, context or analysis, rather than to errors or biases. This is the ‘innovative’ component of the study.
In line with the fundamental norms of science, this comparison should be transparent and systematic. As the methodology of interpreting phenomena is called hermeneutics, it is appropriate to talk about ‘hermeneutic reproducibility’ to define the extent to which a researcher agrees with the interpretation made by another researcher. To be fair, the term was suggested by Victoria Stodden in a discussion about the different types of reproducibility.
There are a few more more reasons why we should care about hermeneutic reproducibility. First, without a systematic comparison, researchers are left to discuss the meaning of their results in terms of direction (positive/negative) and statistical significance (significant/insignificant at a certain level). These measures can be helpful but they are also crude and decontextualized. Second, without clear decision rules, researchers discussing the meaning of their results can easily be victims of interpretive bias, by failing to identify relevant previous studies, failing to compare the same quantities in relevant studies or giving a different meaning to the same result. For example, two different visual presentations of the same result can lead to different interpretations. Ultimately, researchers prone to interpretive bias are likely to give greater weight to their prefered outcome. Last but not least, writing a reproducible discussion is a task that can be taught, delegated, quality controlled and improved.
My next blog post will provide some practical recommendations to enhance the hermeneutic reproducibility of empirical research. Stay tuned and feel free to send me some feedback at firstname.lastname@example.org!