P-Curve: A tool for detecting publication bias

How can we tell when publication bias has led to data mining? In this video, we’ll show you what the distribution of p-values should look like when (1) there are no observable effects of a treatment, (2) there are observable effects, and (3) data mining is likely to have occurred. We also discuss a 2014 article authored by Uri Simonsohn, Leif Nelson, and Joseph Simmons (the same authors of “False-Positive Psychology”) in the Journal of Experimental Psychology that demonstrated widespread data mining in that body of literature.

In this article, authors Joseph Simmons, Leif Nelson, and Uri Simonsohn propose a way to distinguish between truly significant findings and false positives resulting from selective reporting and specification searching, or p-hacking.

P values indicate “how likely one is to observe an outcome at least as extreme as the one observed if the studied effect were nonexistent.” As a reminder, most academic journals will only publish studies with p values less than 0.05, the most common threshold for statistical significance.

Some researchers use p-hacking to “find statistically significant support for nonexistent effects,” allowing them to “get most studies to reveal significant relationships between truly unrelated variables.”

The p-curve can be used to detect p-hacking. The authors define this curve as “the distribution of statistically significant p values for a set of independent findings. Its shape is a diagnostic of the evidential value of that set of findings.”

In order for p-curve inferences to be credible, the p values selected must be:

associated with the hypothesis of interest,
statistically independent from other selected p values, and
distributed uniformly under the null.

It is also important to clarify that the p-curve assesses only reported data and not the theories that they are testing. Similarly, it’s important to keep in mind that if a set of values is found to have evidential value, it doesn’t automatically imply internal or external validity.

Using the p-curve to detect p-hacking is fairly straightforward. If the curve is right-skewed as in the chart to the right in the figure below, there are more low (0.01s) than high (0.04s) significant p values, suggesting truly significant p values. When non-existent effects are studied (i.e., a study’s null hypothesis is true), all p values are equally likely to be observed, thus producing a uniform curve or a straight line. In the figure below, each chart incorporates a uniform curve that is dotted and red for comparison. Curves that are left-skewed however, as in the chart to the left in the figure below, indicate more high p values than low ones; p-hacking has likely occurred.

“P-curves for Journal of Personality and Social Psychology” (Click to expand)

The above figure displays the results of the authors’ demonstration of the p-curve through the analysis of two sets of findings taken from the Journal of Personality and Social Psychology (JPSP). They hypothesized that one set was p-hacked, while the other was not. In the set in which they suspected p-hacking, they realized that the authors of the publication reported results only with a covariate. While there is nothing wrong with including covariates in study’s design, many researchers will include one only after their initial analysis (without the covariate) was found to be non-significant.

Simmons, Nelson, and Simonsohn provide guidelines to follow when selecting studies to analyze with the p-curve:

Create a selection rule. Authors should decide in advance which studies to use.
Disclose the selection rule.
Maintain robustness to resolutions of ambiguity If it is unclear whether or not a study should be included, authors should report results both with and without that study. This allows readers to see the extent of the influence of these ambiguous cases.
Replicate single-article p-curves. Because of the risk of cherry-picking single articles, the authors suggest a direct replication of at least one of the studies in the article to improve the credibility of the p-curve.

In addition to these guidelines, Simmons, Nelson, and Simonsohn also provide five steps to ensure that the selected p-values meet the three selection criteria we mentioned earlier:

Identify researchers’ stated hypothesis and study design.
Identify the statistical result testing the stated hypothesis.
Report the statistical results of interest.
Recompute precise p-values based on reported test statistics. – This has been made easy through an online app, which you can find at http://p-curve.com/.
Report robustness results of the p-values to your selection rules.

As with anything, the p-curve is not one hundred percent accurate one hundred percent of the time. The validity of the judgments made from a p-curve may depend on “the number of studies being p-curved, their statistical power, and the intensity of p-hacking”. There isn’t much concern over cherry-picking p-curves to ensure the result of a lack of evidential value. However, such a practice can be prevented simply with the disclosure of selections, ambiguity, sample size, and other study details.

Additionally, there are a few limitations with the p-curve. First, it “does not yet technically apply to studies analyzed using discrete test statistics” and is “less likely to conclude data have evidential value when a covariate correlates with the independent variable of interest.” It also has a hard time detecting confounding variables; if there is a real effect, but also mild p-hacking, it usually won’t detect the latter.

Simmons, Nelson, and Simonsohn conclude that, with the examination of a distribution of p-values, one will be able to identify whether selective reporting was used or not. What do you think about the p-curve? Would you use this tool?

We’ll use http://p-curve.com/ in an Exercise later in this Activity.

You can read the entire paper here.

Reference

Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534–47. doi:10.1037/a0033242.

Previous Lesson

Back to Week 2

Next Lesson