By Garret Christensen (BITSS)
What are the tools you use to make your research more transparent and reproducible? A lot of my time at BITSS has been spent working on a manual of best practices, and that has required me to familiarize myself with computing tools and resources that make transparent work easier. I’ll be sharing a draft of the manual at the BITSS Annual Meeting, but for now here are a few of the resources I’ve found most useful. If you’d like to learn more about these tools, there are a ton of helpful resources on the respective websites, or for a hands-on learning experience you can sign up for a collaborative training (December 11, 9.00 AM – 12.00 PM) BITSS is organizing with the D-Lab.
- J Scott Long, The Workflow of Data Analysis Using Stata
This book is one of the few things I’ve found written about how to manage the nitty-gritty details of your workflow. I assume that most quantitative social scientists learned how to organize their data analysis the same way I did: by working on a project as an RA for my adviser in grad school and combining that with the little I remembered from AP Computer Science in high school. But how well was that thought out? (Full disclosure, my adviser was Ted Miguel, one of the founders of BITSS, so hopefully I picked up some good habits.) Though this book is written for Stata users, I think it’s worth looking at for users of any statistical package, and useful whether or not you already have your own well-designed system in place for managing your workflow. There are a ton of useful general tips: if you change a file at all after distributing or posting it, you must give the file a new name. Never name a file “final,” because it won’t be. Name variables more informatively: “female” instead of “gender.” And I’ve used Stata for 10+ years, but I still learned several new Stata-specific things: different missing values “.a”-“.z”, the “notes” command, and the “datasignature” command.
- R Studio/R Markdown
I’ve never seen data on it, but I think I’m like most other economists in that I do most of my work in Stata. Recently I’ve been learning a bit of R, and I think I see good reasons for using it. If you really want to do work that’s reproducible in the long term, open source software probably gives you a better chance at doing that. Even if your currently-popular proprietary software is a hit right now, researchers in 20 years might not be able to read your code or open your data if the program doesn’t stay popular and backwards-compatible, and researchers in developing countries, either now or in 20 years, might not be able to afford licenses costing thousands of dollars. R is free. Of course I’ll keep working in Stata as long as so many of my colleagues and collaborators do, but the next time I have to do something a little more mathematical that I would previously have done in Matlab, I’ll do it in R. Specifically, R Studio, which is an integrated package version of R that helps you see your data, code, and output all at once. R Markdown is a tool that is built into R Studio that helps you make very easily readable and reproducible code – the code and the output are woven together into one nicely formatted document. If you want to learn a few basics in these tools, I recommend some of the Data Science Coursera courses by Caffo, Leek, and Peng from Johns Hopkins.
Version control. The basic goal is to back up all previous versions of your work and be able to revert back to any previous version easily. I think that if you’re working by yourself on a small project, there are manageable workarounds that you can do that will be fine for a decent number of your projects. Working by myself, my system is to include the date in the name of every analysis file I save, “runregressions2014.11.25.do”. I try to put a note at the top of the file indicating what changes I made in that version, and I save important files in an appropriately named folder like “Original_Journal_Submission”. By saving all these different versions, I’m certain never to accidentally save over and erase anything I might need later. That’s gotten me this far in life, but it has problems. What if I have a master program file that calls dozens of other smaller subroutine program files? Any time I updated any of the subroutines and changed the date in the subroutine file name, I have to change the master program file and update how it calls the subroutine. This can be a major pain, and it can be even worse if you’re collaborating with other researchers. Having shared file storage on Dropbox, Box, or Google Drive can help, but a real version control system like Git might be the best answer. (Git is the free software that can run on most any system, GitHub is a website that helps you use Git and stores data. There are other version control systems like Git such as Mercurial, the differences are beyond my level of expertise.) With Git, you (or your collaborators) pull the latest files from a central repository, work on them on your own machine, save them (without changing filenames – any time you save a change that you really want to add to the master version, you “commit” the changed files) and push them back to the central repository. Whoever is in charge of the central repository can review the changes and decide whether or not to adopt them into the master version.
Basically, it’s magical, and I’m a big fan. It might not be ideal for a proprietary software document like a Word .docx file, but it will show you a nice comparison of all the changes made in any text/ASCII file. I get the sense that basically all programmers use a system like this for collaborative coding, and social scientists could easily use it for writing their SAS/Stata/SPSS/R/Matlab/LaTeX/Markdown/HTML/whatever code. If you and your collaborators have access to the same network, you can just use Git on your own servers. If you need storage space, GitHub provides free storage as long you make your work public; you can pay for private repositories. They also have great materials to get you started. So does Software Carpentry.
- Open Science Framework
The Open Science Framework (OSF) is the main infrastructure project of the Center for Open Science (COS). (Full disclosure: BITSS is an active partner with, and has received funding from COS.) The OSF is a free project management tool that aims to span the entire workflow, from beginning (planning) to middle (analysis) to end (submission, archiving). The OSF organizes your work by project, then allows you a good deal of flexibility and control over those projects. You can upload files, collaborate with other researchers, make your work public or private, anonymize your work for submission for blind peer review, edit wiki-like summaries of projects and parts of projects, fork an existing project to go off in another direction, or connect to several other research management tools, such as GitHub, Dataverse, Dropbox, and FigShare. OSF has built-in version control to help with all this file management. OSF doesn’t intend to replace any of these tools in particular, but I think it’s a powerful improvement to have a single hub that connects to all of these other tools in one specific place.
I think one of the most useful features so far in the OSF is the ability to easily create registrations of your work. Say you want to explicitly state your hypotheses before you begin your data collection. Write your hypotheses, upload them to OSF, and hit the Registrations tab, give an optional description, and the OSF will create a permanent, frozen, time-stamped version of your project with a persistent URL. This is useful not only for creating proof of one part of a project being done before another, it can also just be useful in keeping track of major versions of projects. Create a registration when you submit a paper to a journal, then easily go back to it when you get the revise and resubmit.
The OSF is in beta, and they are working on adding additional tools into the system. Send an e-mail to firstname.lastname@example.org if there’s another research management tool you use that you’d like them to consider or if you have any other comments or suggestions.
The Dataverse is a data repository in Harvard’s Institute for Quantitative Social Sciences. Basically, instead of posting your code and data on your own website, where you might only post it in one proprietary format, on a website that may disappear when you switch jobs, and which other researchers might struggle to find, you can post your data to a trusted repository that will outlive you, collect data from numerous similar projects by other researchers and put it under the same searchable roof, and automatically make your data available in several formats: Stata, R, SPlus, and tab delimited. See Esther Duflo et al’s project in India for one example. It’s very easy to get the data from that project in any format, or to get the data from other Poverty Action Lab projects.
I’m less familiar with it, but there are other similar resources if you’re looking for which repository to use. Maybe try Open ICPSR, or check the repository of repositories (OK, directory of repositories) to see which repository is the best for data from your field.
So those are five tools I’ve recently begun using to manage my research. BITSS would love to hear it if you have comments on the tools above or other suggestions for tools we should look into. Leave a comment below or send @UCBITSS a message on Twitter.