Out of the file drawer: Tips on prepping data for publication

Guest post: Julia Clark, Graduate Student Researcher, Policy Design and Evaluation Lab; ClarkPhD student in the Department of Political Science, UCSD

 

 

 


Publishing data and code is crucial to ensuring that social science research is transparent and reproducible. But researchers are busy people. Once a manuscript is accepted, the task of cleaning, annotating, and debugging project files for public consumption often moves quickly to the bottom of the priority list. This lack of preparation can result in replication files that are illegible (for humans or computers), incomplete, or may not fully reproduce the published results. Addressing this all-too-common problem requires identifying good practices for preparing public data files and then helping researchers integrate these practices into their normal research habits.

This year at UC San Diego’s Policy Design and Evaluation Lab PDEL we’ve been offering a service to help faculty do just that. With support from the University’s Integrated Digital Infrastructure program, RAs (like me) work with PIs to streamline code, anonymize personally identifying data, organize files, and replicate analyses (in the “pure” sense of the term, using original data and code). The end product is an easy-to-navigate folder with complete replication materials that can be submitted to journals, online repositories, or elsewhere.

As we’ve worked with different researchers over the past few months, we’ve honed our processes into a relatively efficient workflow, which is reproduced below. We hope that this serves as a useful resource for other researchers (or their RAs) who need to disseminate data and code for existing projects. Furthermore, we hope that creating awareness about these backend processes will encourage researchers to plan ahead as they begin new projects, reducing the time and resources needed to prepare replication files in the future. Questions and feedback are very welcome!

Goals

Social science projects vary enormously in the type and size of data they use, and in the nature of the analyses conducted. Furthermore, each PI or team of researchers has their own software preferences and organizational style. Given this diversity, files look very different from one project to the next, and there is no one-size-fits-all solution to preparing replication files. Still, any efforts to organize and disseminate replication data should attempt to ensure that:

  1. Files are complete. All of the data, code, and supplementary materials (e.g., codebooks) needed to generate and interpret results (tables, figures, etc.) are included and organized in an intuitive manner. Unnecessary ancillary files (e.g., old versions of code and data, etc.) should not be included.
  2. Personal data are protected. As we know from the IRB process, personally identifiable information (PII)—names, phone numbers, email, addresses, etc.—cannot be included in a public dataset. When possible, anonymization of PII should come sequentially before merging and cleaning so that the data and code for these processes can be shared publicly.
  3. Code is readable. Code should be streamlined and legible. Scripts that run analyses should be separate from those that merge and clean data, and documentation (or the script names themselves) should clearly indicate the order in which they should be run and for what purpose. Comments should be used to help the human reader understand what the researcher is doing. Code that generates the main results of the paper should be clearly identifiable, and not obscured by supplementary and exploratory analyses.
  4. Everything works. Code and data should reproduce the paper’s results without error. Here, it can be helpful to have someone new to the project prepare the files, as a fresh pair of eyes may be more likely to catch errors. Note that running the code on a different computer, operating system, or software version can sometimes catch—and sometimes create—problems with replication.

Workflow

At PDEL, our workflow is designed to meet the above goals, while allowing for flexibility based on the project’s unique data and code. Note that this process is based mostly on experiences working with project files stored locally or in Dropbox without version control software (the current setup used by a majority of our PIs). However, many of the following steps would be faster, easier, or unnecessary with platforms like Open Science Framework OSF or Git, and we encourage researchers to make the switch for future projects!

1. Setup a separate folder for the replication files.

Good replication files will contain all the materials that are necessary to reproduce the study results—including data merging, cleaning, and analysis—with few extraneous files. Rather than copying or cleaning out existing directories, we’ve found it best to create a new, clean folder, and then add to it only those files needed for replication. This helps preserve the original data and code and avoid extraneous material. This folder can be organized in a number of ways appropriate to type and number of files you will have, but the structure should be clear and logical. See here for a downloadable template. [Note: if you’re using Dropbox, see here for more tips on sharing folders with RAs in a way that protects PII data.]

  • Create a new (empty!) replication folder (e.g., “RCT_replication_files”), within your project directory.
  • Create subfolders such as “/code”, “/data_clean“, “/data_raw”, “/output”, and “/extra”.
  • Add a “readme.txt” file, and as you go through the workflow below, document each file in the replication folder (ideally including its function and source), along with other info such as system and software requirements.

2. Initial replication

Identifying the source of any problems in the code is easiest if you do the replication iteratively, beginning with the original code and data—if you clean and restructure documents before replication, it’s hard to know if any errors come from the original code or from your edits. We find that it’s easiest to start with the final analysis and work your way backwards in the code through the cleaning and merging processes. In each case, the original code and data files are COPIED (not moved) into the replication folder. Absent a version control system, this is the best way to protect the original work.

  1. Check analysis:
  2. Copy the original analysis script(s) into RCT_replication_files/code
  3. Copy the dataset(s) used for analysis into RCT_replication_files/data_clean
  4. Run code without making changes except for pointing the working directory to your new replication folders
  5. Fix any bugs in the code and address any discrepancies with the paper’s results
  6. Check data merge/cleaning:
  7. If they are separate from the analysis script, copy the original merge/cleaning script(s) into RCT_replication_files/code
  8. Copy the dataset(s) used for merging/cleaning into RCT_replication_files/data
  9. Run code without making changes except for pointing the working directory to your new replication folders
  10. Run the analysis file debugged above on the newly created data file
  11. If you get different results than step #1, there is a problem with the merging/cleaning code

3. Clean and curate:

Once the original code has been cleaned and debugged, it’s time to improve legibility and organization. If you’ve developed your files with public consumption in mind from the beginning, this process should go quickly. Other resources—see here or here[1]—give a more thorough set of coding best practices, but basic steps include:

  • Anonymize data (if not already done): (Note also that the person doing the data replication should be named on the IRB protocol if they are working on data containing PIIs).
    • Ensure that no PII is included in datasets that will be public, including name, email, phone number, etc.
    • Ensure that individuals are not identifiable based on a combination of other attributes (e.g., if you’re surveying teachers and there is only one female, third-grade teacher aged 50-59 at a particular school, then she is not anonymous in your data)
    • Move the anonymization process as early as feasible in the data merging/cleaning process, so that as much as possible of the data manipulation process can be made public
    • Even though the PII data cannot be shared, do include any code that manipulates this restricted data for transparency as long as the code itself doesn’t compromise anonymity (e.g., censor code that sets the seed for a random draw to generate new ID numbers and could be used to reverse anonymization)
  • Organize and format scripts:
    • Create separate scripts for analysis and merging/cleaning code
    • Move exploratory analysis or those not used in the paper to the end of the analysis file—preserving these is good for posterity, but it shouldn’t obscure the main results
    • Add headers, including the paper and author’s names, date and creator of code, the input files that the code requires, and the output files that it generates (template included in our replication folder template
    • Set the working directory at the start of the code and use abbreviated file paths for the rest of the document so that a future user only has to change the file path in one location
    • Format scripts so they’re easily readable (e.g., indent code, standardize comment syntax)
  • Document and annotate code:
    • Clearly label code that generates the tables and figures that appear in the paper
    • Keep output commands for papers and figures in the paper and appendix, as long as they all go to the “outputs” folder you’ve created, and comment-out output commands for tables and figures tables not used in the paper
    • Give output objects sensible names like “table1”
    • Add comments when needed to improve reader understanding; remove comments that are unhelpful (or embarrassing!)
    • Label variables and values in Stata
  • Document folder contents:
    • Include codebook where necessary
    • Update the readme file as needed

4. Final replication:

Now that you’ve cleaned and reorganized script files, rerun the entire process—including data merging, cleaning and analysis—to make sure the results are consistent. Once discrepancies are addressed, the files are ready to send!

[1] See also J. Scott Long. 2008. The Workflow of Data Analysis Using Stata, and Christopher Gandrud. 2013. Reproducible Research in R and R Studio.


Do you have your own experience with preparing data and code for publication? Share your thoughts in the comments below.

BITSS encourages submission of guest blog posts. If you’re working in the field of research transparency, open science, and reproducibility and have something to share, please contact garret -at- berkeley -dot- edu.

Leave a Reply