By Temina Madon (CEGA, UC Berkeley)
As we all know, experimentation in the natural sciences far predates the use of randomized, controlled trials (RCTs) in medicine and the social sciences; some of the earliest controlled experiments were conducted in the 1920s by RA Fisher, an agricultural scientist evaluating new crop varieties across randomly assigned plots of land. Today, controlled experiments are the mainstay of research in plant, microbial, cellular and molecular biology. So what can we learn from efforts to improve research transparency in the natural sciences? And given that software engineers are now embracing experimental methods, are there lessons to be learned from computer science?
In the life sciences, advances in molecular biology and the genomics revolution—coupled with improvements in robotics and process automation—have enabled massively high-throughput data collection. Biologists can now quantitatively measure the expression of tens of thousand of genes (i.e. dependent variables) across hundreds or thousands of samples, under multiple experimental conditions. As a result, the volume of data has expanded exponentially in just a few years. And while data sharing has also improved, there is concern about the reproducibility and validity of results (particularly when hypotheses are not well defined in advance… which can quickly turn experiments into fishing expeditions). How are scientists addressing this issue?
The scientific community has identified several social norms that erode the integrity and transparency of quantitative ‘omic research. One issue is that the ‘methods’ sections of articles describing empirical studies have steadily contracted over time, providing ever fewer details of study design. Today’s publications provide little guidance for scientists wanting to re-run another researcher’s study. To combat this problem, a group of biologists has begun promoting minimum reporting guidelines for several classes of experiments. An analog for the social sciences might be a voluntary checklist for the reporting of intervention protocols, survey instruments, and methods (or scripts) for data cleaning, filtering, and analysis. This would include the disclosure of all tests of a given hypothesis, not just the ones that yield “interesting” results.
‘Omics’ researchers are also considering rewards to investigators whose experimental data are made public and then validated and used by independent research groups. This could incent experimental researchers to openly share protocols, data, and algorithms in user-friendly formats; it could also drive improvements in the comparability of studies, by promoting consistent measurement of common variables (Similar incentives may be required to improve data sharing in the social sciences, according to statistician and political scientist Andrew Gelman). The same group of biologists also recommends targeted support for the replication of promising discoveries that are slated for adoption by governments or other influential decision-makers. Before evidence is translated into action, there should be an extra “gut check”: a replication study outsourced to an independent research group.
This leads us to a second barrier to scientific integrity: top research funders have failed to invest in replicating and reproducing scientific experiments. There is a parallel failure of top academic journals to publish follow-on research. As a result, initial (first-in-class) findings often go unchallenged, and new research pushes forward on the shoulders of unconfirmed results. In response, some genetics journals are publishing special issues dedicated to the non-replication of original findings; others have committed to publish “non-results” just as they do studies with positive outcomes.
Experimental psychologists have responded to the replication problem by creating a FileDrawer for “non-results,” i.e. independent studies that fail to replicate previously published findings. But without vigorous peer review, how do we ensure that non-replications are conducted with rigor, and without introducing additional bias? A special issue on replication was recently published in Perspectives on Psychological Science, and this could be a useful next step for other disciplines in the social sciences.
Replication isn’t simply a matter of academic integrity. In cancer research, there has been a public call to replicate the findings of pre-clinical studies—particularly those reported by industry researchers—in order to drive down the costs of drug development, since replication can help detect and eliminate failures earlier in the drug development pipeline. This could have the added benefit of increasing the power to detect promising drug candidates (i.e. when the results of small, first-in-class experimental trials are pooled with the results from direct replications).
Similarly, biomedical researchers have called for the replication of specific genome-wide association studies, to prevent promising discoveries from moving into the clinic without appropriate validation. Toward this end, they have now defined criteria for replication of initial studies, to ensure that investments in follow-on research are warranted. Replication need not be a condition for further research, but it can be a tool for ensuring that evidence from observational (or even quasi-experimental) research is validated before being translated into public health practice.
Could criteria and norms for the replication of field trials someday become part of the public policy-makers’ dialogue?
Questions remain about whether or not trial replication is feasible in the social sciences. Field studies, like clinical trials, are costly; and there has been little political willingness to invest in repeating others’ efforts. However, genomic researchers once made these same arguments. The imperative to improve the reproducibility of results actually drove down the costs of experimentation. Similarly, an increased focus on replication of field experiments (coupled with advances in information and communication technologies) could reduce costs of empirical research in the social sciences.
Of course not all replications yield the same information. In software engineering, there has been a recent effort to document and refine the various typologies of scientific replication and reproduction. Psychologists like Hal Pashler have also explored the variety of replications that can be performed by experimentalists. A possible typology for the empirical social sciences, building on others’ work, might include:
Direct (or Strict) Replications: These use the same experimental design and conditions as the original study– including protocols, instruments, and study populations. They can be carried out by the same experimenter, or by independent groups (to reduce researcher bias).
Conceptual Replication: In this approach, investigators implement the same overall experimental design (i.e. testing the same hypothesis, and focusing on the same outcomes of interest) but with variation in protocols and/or measurement tools.
Analytic Replication: This is the re-analysis of experimental data, also called pseudo-replication, using data from an existing experiment. Analytic replications require the release of a researcher’s original data files to an outside group for independent statistical analysis. In principle the data can be blinded to reduce researcher bias.
Reproduction: An experiment that is “reproduced” requires an investigator to come up with an entirely new experimental design (i.e. novel protocols, measurement instruments, study populations, and environment/context) in order to observe a phenomenon that has been previously identified and reported. This validates initial findings and can potentially contribute to the external validity of a finding.
Empirical Generalization: This is the replication of an existing experiment using the same protocol and design, but implementing the study in a new population.
An interesting twist on replication is “critical assessment,” an approach from computational biology that is feasible when the results of an experiment have not yet been publicly released. This often takes the form of a competition, with independent research groups invited to design models that predict the outcomes of the experiment. Baseline data are made available to researchers, and the experimental outcomes are withheld for the judging process. The model that comes closest to predicting actual results is awarded a prize. This approach could theoretically be adapted for field trials in the social sciences.
Why does all of this matter? Does scientific integrity have an impact beyond the ivory tower? It is interesting that the U.S. Government’s Office of Management and Budget has recently issued guidelines and standards for the reliability of scientific data (particularly when applied to government decision-making).
An agency is directed, “to the degree that an Agency action is based on science,” to use “(i) the best available, peer-reviewed science and supporting studies conducted in accordance with sound and objective scientific practices; and (ii) data collected by accepted methods or best available methods (if the reliability of the method and the nature of the decision justifies use of the data).” We also note that the OMB guidelines call for an additional level of quality “in those situations involving influential scientific or statistical information”…
The Office further elaborates on when scientific or statistical information is “influential”:
“Influential” when used in the phrase “influential scientific or statistical information” means the agency expects that information in the form of analytical results will likely have an important effect on the development of domestic or international government or private sector policies or will likely have important consequences for specific technologies, substances, products or firms.
But even OMB fails to recognize the value of replicating empirical findings, taking peer review to be adequate in most cases:
If the results have been subject to formal, independent, external peer review, the information can generally be considered of acceptable objectivity.
In those situations involving influential scientific or statistical information, the results must be capable of being substantially reproduced, if the original or supporting data are independently analyzed using the same models. Reproducibility does not mean that the original or supporting data have to be capable of being replicated through new experiments, samples or tests.
Making the data and models publicly available will assist in determining whether analytical results are capable of being substantially reproduced. However, these guidelines do not alter the otherwise applicable standards and procedures for determining when and how information is disclosed. Thus, the objectivity standard does not override other compelling interests, such as privacy, trade secret, and other confidentiality protections.
In its guidance to federal agencies, OMB defers to the established standards of data sharing within research communities. Fortunately, as these blog posts demonstrate, there are now several community-led movements to increase transparency—in political science and economics, as well as clinical medicine and the biological sciences.
In the coming years, some of the existing techniques for improving integrity and transparency (including those developed by researchers in the 1960s) will diffuse to fields like economics and education… approaches like blinding, the use of both positive and negative controls, and trial registration. A few of these may be less amenable to experiments in the ‘real world,’ but others could be adopted without high costs.
Ten years from now, it will be interesting to look back on the innovations in research transparency introduced by social scientists. In future, there may be a diffusion of norms from economics and political science, to clinical research and the natural sciences.