Science is “show me,” not “trust me”

philip_starkGuest post by Philip B. Stark, Associate Dean of the Division of Mathematical and Physical Sciences, UC Berkeley Professor of Statistics, and winner of one of BITSS’ Leamer-Rosenthal Prizes for Open Social Science.





Reproducibility and open science are about providing evidence that you are right, not just claiming that you are right. Here’s an attempt to distill the principles and practices.

Shortest: Show your work.

Next shortest: Show your work. All your work.

A bit longer: Provide enough information to check whether your claims are right: describe what you intended to do, and provide convincing evidence that you did what you intended.

Others should be able to check whether your tables and figures really do result from doing what you said you did, to the data you said you did it to.

That includes being able to tell whether the code matches the math and the math matches the verbal description.

None of these things proves that what you did was the right thing to do, only that you did it. But it gives others the opportunity to assess whether it was the right thing to do, and hence can provide evidence for or against your scientific conclusions.


There are many reasons to work openly and reproducibly. My top reasons are:

  1. It gives others the opportunity to check whether my work is correct, and to correct it if not.
  2. It enables others1 to re-use and extend my work easily.
  3. It makes analyses available as artifacts that can serve as data for the study of the practice of science.

All these tend to help science progress more rapidly.


It’s hard to give a blanket recipe for open and reproducible science across disciplines, but some things obviously are problems. Here’s an attempt at a diagnostic hierarchical checklist of reproducible practices. I find that looking carefully at my workflow wherever I fail one of these tests helps me improve.

  1. If you relied on Microsoft Excel for computations, fail.2
  2. If you did not script your analysis, including data cleaning and munging, fail.3
  3. If you did not document your code so that others can read and understand it, fail.4
  4. If you did not record and report the versions of the software you used (including library dependencies), fail.5
  5. If you did not write tests for your code, fail.6
  6. If you did not check the code coverage of your tests, fail.7
  7. If you used proprietary software that does not have an open-source equivalent without a really good reason, fail.8
  8. If you did not report all the analyses you tried (transformations, tests, selections of variables, models, etc.) before arriving at the one you chose to emphasize, fail.
  9. If you did not make your code (including tests) available, fail.9
  10. If you did not make your data available (and a law like FERPA or HIPPA doesn’t prevent it), fail.10
  11. If you did not record and report the data format, fail.11
  12. If there is no open source tool for reading data in that format, fail.
  13. If you did not provide an adequate data dictionary, fail.12
  14. If you published in a journal with a paywall and no open-access policy, fail.13

If you have suggestions to improve this list, please tweet @philipbstark with hashtag #openScienceChecklist.


I think reproducibility and open science would make huge strides if everyone pledged:

A. I will not referee any article that does not contain enough information to tell whether it is correct.

If you are committed, add:

B. Nor will I submit any such article for publication.

And if you are brave, add:

C. Nor will I cite any such article published after 1/1/2017.

If you are willing to sign one or more of the pledges above, please tweet @philipbstark with hashtag #openSciencePledge, indicating which of A, B, C, you pledge to do, and the name to use for your signature. I’ll publish the results.

  1. For this purpose, “others” includes me, next week.
  2. In general, spreadsheets make it hard to work reproducibly and accurately. It’s hard to document and reconstruct what you clicked and in what order, and the user interface conflates input, output, code, and presentation, making testing code and discovering bugs difficult. For examples of spreadsheet horror stories, see The European Spreadsheet Risk Interest Group. But using a buggy spreadsheet application makes things even worse. Excel has had many severe bugs in the past, including bugs in addition, in multiplication, in random number generation, and in statistical routines, some of which persisted for several versions of the program. See, e.g.,
  3. This doesn’t mean you can’t use point-and-click tools to figure out what needs to be done, just that once you do figure it out, you should automate the process to document what you did so you (and others) can check the process and regenerate everything from scratch. A tool like might help.
  4. A good test is to look at your own code a month after you wrote it. If you can’t read it, neither can I.
  5. There are software tools to help with this. See, for instance, noWorkFlow.
  6. There are great tools to automate software testing (but not to create the tests, which requires knowing what the software is supposed to do!). Test your software every time you change it. This is a reason not to rely on spreadsheets for computing: it’s not straightforward to test spreadsheet calculations.
  7. There are also great tools to automate checking the coverage of tests. It is nice to report the coverage of your tests, although I’ve never seen a scientific publication that did.
  8. The following are not really good reasons: “My university has a site license.” “It’s the tool I learned in graduate school.” “It’s too much trouble to port my code.” At first blush, a good reason is that the particular algorithm you need is very complex, is not implemented in any open-source environment, and you lack the skill to re-implement it in an open-source fashion. But if that’s true, what evidence do you have that it’s implemented correctly in the proprietary tool? Essentially all software has bugs, and software that doesn’t let you look under the hood should be treated with particular suspicion. If you do use proprietary software that has an open-source equivalent, it’s nice to test that your code works in the open-source version, too, and to record and report the version that worked.
  9. Your code should also state how it’s licensed, for instance, a BSD license, the X11 (MIT) license, the Apache license, or a GNU GPL license, along with the version of the license that applies. Ideally, code should be published in a way that makes it easy for others to check it and to re-use and extent it, e.g., by publishing it using a tool/service like Github.
  10. I might go so far as to argue that results based on proprietary data should not be published in the scientific literature unless there is some way to provide convincing evidence that the results are correct without access to the data. This might require developing new techniques based on encryption or noisification, such as differential privacy.
  11. It is nice if the data are in a standard format.
  12. A data dictionary that says things like mnth_var = month variable is less than helpful. I’ve seen many data dictionaries that were useless.
  13. Allowing you to post the final version of your paper on a reprint server might be enough, but I think it’s time to move to open scientific publications. Most publishers I’ve worked with have let me mark up the copyright agreements to keep copyright myself and grant them a non-exclusive right to publish.

One thought on “Science is “show me,” not “trust me”

Leave a Reply