Targeted Learning from Data: Valid Statistical Inference Using Data Adaptive Methods

By Maya Petersen, Alan Hubbard, and Mark van der Laan (Public Health, UC Berkeley)

Statistics provide a powerful tool for learning about the world, in part because they allow us to quantify uncertainty and control how often we falsely reject null hypotheses. Pre-specified study designs, including analysis plans, ensure that we understand the full process, or “experiment”, that resulted in a study’s findings. Such understanding is essential for valid statistical inference.

The theoretical arguments in favor of pre-specified plans are clear. However, the practical challenges to implementing such plans can be formidable. It is often difficult, if not impossible, to generate a priori the full universe of interesting questions that a given study could be used to investigate. New research, external events, or data generated by the study itself may all suggest new hypotheses. Further, huge amounts of data are increasingly being generated outside the context of formal studies. Such data provide both a tremendous opportunity and a challenge to statistical inference.

Even when a hypothesis is pre-specified, pre-specifying an analysis plan to test the hypothesis is often challenging. For example, investigation of the effect of compliance to a randomly assigned intervention forces us to specify how we will contend with confounding. What identification strategy should we use? Which covariates should we adjust for? How should we adjust for them? The number of analytic decisions and the impact of these decisions on conclusions is further multiplied when losses to follow up, biased sampling, and missing data are considered.

Pre-specifying complex analytic decisions based on a priori specified parametric models runs the substantial risk that the models will be wrong, resulting in bias and misleading inference. Such an approach is nonetheless sometimes advocated as a lesser evil than unsupervised data mining. Fortunately, modern statistics provides other alternatives.

Nonparametric (or semiparametric) methods are based on the premise that models should represent what is truly known about the process generating the data. Once we accept this premise, it is immediately clear that methods capable of learning from data in a pre-specified way are essential. Recent advances, including targeted maximum likelihood estimation, make it possible to use data-adaptive (“machine learning”) techniques to draw valid statistical inferences by targeting an initial fit of the data towards the desired estimand/hypothesis of interest. These are more than just interesting theoretical developments- they are now implemented in R packages such as SuperLearner and tmle.

The process of generating a hypothesis should be distinguished from the analytic methods used to test it. Even when a hypothesis has not been pre-specified, validity of inference can be improved by using targeted data-adaptive methods for analysis. Rigor can be improved further by pre-specifying the approach by which data will be used to generate new hypotheses. Examples include recent work on adaptive trial designs.

We strongly support the interdisciplinary movement to increase transparency at all levels of the scientific process, including funding, study design, hypothesis generation, data analysis, and reporting. In order to meet this objective, statistical methods capable of providing valid inferences while allowing us to learn from data are needed. Happily, such methods are increasingly available.


About the authors:

Maya Petersen is an Assistant Professor of Biostatistics and Epidemiology at the University of California, Berkeley School of Public Health. Maya’s research focuses on the development and application of novel causal inference methods to problems in health, with a focus on the treatment and prevention of HIV.

AlanHubbard (1)

Alan Hubbard is an Associate Professor of Biostatistics at UC Berkeley. Alan’s primarily works on the analysis of high dimensional data using semi-parametric statistical methods in applications such as using prognostic factors in severe trauma patients, the molecular biology of aging, and diarrheal disease in developing countries.


Mark van der Laan is Professor of Biostatistics and Statistics at the University of California, Berkeley School of Public Health. Mark’s research concerns targeted statistical learning, adaptive designs, causal inference, and its applications with a focus on HIV, clinical trials, and safety analysis.

This post is one of a ten-part series in which we ask researchers and experts to discuss transparency in empirical social science research across disciplines. It was initially published on CEGA blog on March 20, 2013. You can find the complete list of posts here