Ensuring Reproducibility in Large Research Teams

Introduction from BITSS: Today on the BITSS blog, Thomas Brailey shares takeaways from his Catalyst training project which involved onboarding in reproducible workflows for members of the J-PAL Payments and Governance Research Program. Check out the training materials developed as part of the project and read on to learn more!

Holding all else equal, ensuring a reproducible and transparent research pipeline is more straightforward with fewer team members. When we discuss achieving reproducible social science in the abstract, there are four broad steps that need clear documentation: 1) obtaining the data; 2) cleaning and wrangling the data; 3) analyzing and visualizing the data, and 4) archiving or releasing the data to the public. With a few principal investigators and research assistants to collect and work on the data, this process has been, in my experience, relatively straightforward. However, ensuring a reproducible workflow becomes markedly more tricky when the project has many team members or is integrated into non-academic bodies such as non-profits or governments. Such organizations face an uphill battle in keeping to the ground rules of transparent and ethical research, especially if their partners do not emphasize the norms of transparent social science.

“[E]nsuring a reproducible workflow becomes markedly more tricky when the project has many team members or is integrated into non-academic bodies such as non-profits or governments.”

One might assume that whatever works for a small research team simply scales up for larger teams, but I would argue that far more care needs to be taken with the latter. This is because individual team members will have different levels of exposure to reproducible practices, expectations of the research process, and deliverables and responsibilities. Does non-analysis code (e.g., back-checks, logic checks, cleaning, and recoding code) need to be treated the same as analysis code, even though it won’t get included in a manuscript’s replication package? Do policy reports or updates for government officials need to emphasize replicability, even if those industries are not placing the same emphasis on transparency as in academia? The answers to these questions, I believe, are absolutely, yes. With that said, there appears to be very little literature focusing on this particular aspect of reproducible social science, so I will discuss some concrete options to ensure transparency in large research teams (this guide offers a fantastic overview of the whole research pipeline for large teams but does not focus on the interplay between, and challenges faced by the whole team.).

First, it is important to ensure that all code is version controlled, irrespective of what it does or who it is for. The industry standard (at least in political science) version control software is GitHub, and there are plenty of useful guides for getting this setup. Broadly speaking, each project should be stored as a single repository, with separate folders for cleaning, analysis, and replication code. Each researcher should create their pull request when working on a specific task, then assign another RA to review the changes before merging them into the main branch. Beyond reproducibility, this method ensures accountability among researchers and allows teams to see all changes made to code files from the beginning of time (e.g., Dropbox only allows version history tracking for 180 days). Datasets can be stored on GitHub, but it is not necessary to do this, given that there usually isn’t a reason to overwrite a raw dataset. There also exist several trusted data storage sites which guarantee permanence and catalog stored data. Documents (.word, .tex, .pdf, etc.) can be stored on GitHub and version controlled, but it is not considered industry standard to do so. A bifurcated system where all code is version controlled and non-code files are kept in shared storage space can work well for large research teams, though for simplicity, storing all files on GitHub (e.g., linked through the repositories Wiki page) might be helpful.

Second, within this reproducible framework, it is important to ensure that cleaning and analysis are kept parsimonious and well-documented. The findings that you publish and present to governments may well be replicable, but if it is based on bad analysis, then it is meaningless. A 2015 PNAS article suggests that the best way to prevent replicable, but poor analysis is to “increase the number of trained data analysts in the scientific community and […] identify statistical software and tools that can be shown to improve reproducibility and replicability of studies”. Having a well-documented standard for conducting data analysis and data visualization that is uniform across the organization helps thwart potential mistakes or misleading results.

“The findings that you publish and present to governments may well be replicable, but if it is based on bad analysis, then it is meaningless.”

Third, large research teams should encourage non-academic entities with whom they interact to publish codebooks and thorough documentation accompanying any data that they share. Even if these data are not to be shared with the broader public, it is important for the research team to know exactly how the data were generated. It is exciting to see organizations such as J-PAL focus on bridging the gap between their survey experiments and the administrative data they use for analysis. J-PAL’s Innovations in Data and Experiments for Action (IDEA) Initiative “supports governments, firms, and non-profit organizations […] who want to make their administrative data accessible in a safe and ethical way”. With a survey, the research team has full control over the instrument and knows exactly how each variable is generated, but it is just as important to verify the validity of any external data used for analysis because bad data, like the bad analysis practices discussed above, cause misleading results.

Fourth, it can be very helpful to have at least one team member, or an outside consultant, who remains up-to-date on the latest reproducible science practices to monitor the codebase and train the team members. This ensures that all researchers working with code and data can easily collaborate in a single repository. It is vital that all team members, even those who are not in direct contact with code and data, are aware of the importance of reproducible best practices and have exposure to the version control software that their team uses.

In this post, I have outlined some of the challenges faced by large research teams with regard to ensuring transparency throughout their research pipeline. I have also pointed to a few potentially useful practices that can help these diverse and complex organizations adhere to the tenets of reproducible social science. For those who are interested, all of our team’s onboarding materials can be found in our dedicated Open Science Framework repository. Want to share your experience and helpful resources for collaborating in large teams? Get in touch!

Thomas Brailey is a research associate with the Payments and Governance Research Program at UC San Diego. He received a B.S. in political science and data analytics in 2020 and will begin his MPhil in Comparative Government at the University of Oxford in 2022. He can be contacted via his LinkedIn page.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.