By David Laitin (Political Science, Stanford)
My claim in this blog entry is that political science will remain principally an observation-based discipline and that our core principles of establishing findings as significant should consequently be based upon best practices in observational research. This is not to deny that there is an expanding branch of experimental studies which may demand a different set of principles; but those principles add little to confidence in observational work. As I have argued elsewhere (“Fisheries Management” in Political Analysis 2012), our model for best practices is closer to the standards of epidemiology than to that of drug trials. Here, through a review of the research program of Michael Marmot (The Status Syndrome, New York: Owl Books, 2004), I evoke the methodological affinity of political science and epidemiology, and suggest the implications of this affinity for evolving principles of transparency in the social sciences.
Two factors drive political science into the observational mode. First, as with the Center for Disease Control that gets an emergency call describing an outbreak of some hideous virus in a remote corner of the world, political scientists see it as core to their domain to account for anomalous outbreaks (e.g. that of democracy in the early 1990s) wherever they occur. Not unlike epidemiologists seeking to model the hazard of SARS or AIDS, political scientists cannot randomly assign secular authoritarian governments to some countries and orthodox authoritarian governments to others to get an estimate of the hazard rate into democracy. Rather, they merge datasets looking for patterns; theorizing about them; and then putting the implications of the theory to test with other observational data. Accounting for outcomes in the real world drives political scientists into the observational mode.
Second, political scientists retain an interest in group outcomes that transcend individual differences. As Marmot points out, “The reason why one San Franciscan gets heart disease and not another may be different from the explanation of why the heart-disease rate is higher in San Francisco than it is in Tokyo. Within medicine, the primary concern is with individual differences…Public health is more likely to be concerned with the health of groups” (p. 31). Consider this analogy in political science. Excellent field-experimental work in the war zones of Iraq and Afghanistan can show – focusing on the rules of engagement (ROE) and the size of development funding for any project, both of which can be randomly assigned across villages – the micro conditions that reduce insurgent killings of villagers. But would optimum ROE or development assistance in these places have changed the war outcomes? To answer that question, political analysts would need to examine the conditions that COIN-doctrine can defeat an insurgency, and more likely looking at observational data on GDP of the country experiencing the insurgency, the quality of the terrain, the commitment of the counter-insurgency force, etc., factors that are naturally controlled for in any experimental design. Similarly, factors that explain the marginal effects of GOTV treatments are not likely to be major factors in explaining long-term secular trends in voter turnout.
It could be argued, as demonstrated in Alan Gerber and Donald Green’s Field Experiments (New York: Norton, 2012) that the experimental world is increasingly addressing consequential outcome variables in the real world. I agree. However, as political scientists aggregate up from individual field experiments in order to assure some level of generality, they will necessarily look for (observational) correlates to account for the wide variety of outcomes relying on the same protocols across experimental fields. In the world of politics, we cannot close our eyes to what is in front of us.
Back to public health, Marmot’s research program is a fishing expedition that has likely reined in a whale. Beginning in 1967, he collected health data (known as the Whitehall study) on thousands of British civil servants. The stunning finding was that at each level of status within this bureaucratic hierarchy, life expectancy increased. This could not be explained by access to health care or salary levels. There were also controls for factors that are associated with individual propensities to health failure, such as smoking or diet. And he was also able to show (through the use of height as an instrument) that pre-career health features could not account for both success in the bureaucracy and long life. Marmot explained this outcome by focusing on autonomy (for which upper grades had in abundance, even though they had more stress on the job), and linking autonomy to better health. Data from experiments around the world – e.g. comparing Japanese and American auto workers, with the former having more autonomy on the job and also lower rates of absence due to sickness – and from the animal world (comparing hierarchical vs. non-hierarchical species and relative health) were mostly consistent with the autonomy conjecture. One great experiment, relying on a discontinuity design, revealed longer life to Oscar winners than actors who were nominated but did not win. (Alas for the theory, there was no difference between also-rans and winners among script writers). Although the mechanism linking job autonomy to health was hardly nailed down in this popularized version of Marmot’s research program, in the British Medical Journal (BMJ) (Bosma, Hans, Michael G. Marmot, Harry Hemingway, Amanda C. Nicholson, Eric Brunner, Stephen A. Stansfeld, “Low job control and risk of coronary heart disease in Whitehall II (prospective cohort) study” Vol. 314, No. 7080, Feb. 22, 1997, pp. 558-565), his team presents evidence linking higher fibrinogen concentrations – associated with susceptibility to a coronary event — and low job control.
Readers of the published papers from this research team in the scientific journals might fear that the team has indeed been on an extended fishing expedition and that their whale has not yet been fully reined in. In one paper (De Vogli, Roberto, Jane Ε Ferrie, Tarani Chandola, Mika Kivimäki, Michael G Marmot (2007) “Evidence Based Public Health Policy and Practice, Unfairness and health: evidence from the Whitehall II Study” J Epidemiol Community Health 61: 513-518) – and the authors are reasonably up-front about the inferential problems delimiting confidence in their results – the authors show a statistically significant relationship between respondents within the British civil service who self-report that they are treated unfairly and coronary events. For one, as Martin White pointed out in his Lancet commentary on the BMJ study (350 (9088): 1404 – 1405, 8 November 1997), standard statistical issues were not properly addressed: the favored independent variable was but one component of SES with no attempt to determine whether it was a proxy for SES or it stood independently of SES; the p-values bordered on non-significance; and the overall model was not very robust to alternative specifications. But this raises a problem more closely related to the concern of this blog, viz. that the team rejected at least one suspected “cause” of coronary events as non-significant, viz. hostility in the workplace. Shouldn’t they have then (in compliance with the Bonferroni correction), for each right hand side variable tested, lowered the p-value that would be necessary to establish significance?
This reading in public health suggests several implications for political science. One, the x (social status) and the y (health) look very much like a standard political science problem, much more so than tests for the effectiveness of a new drug. Second, fishing (which the authors obviously engaged in) yielded a pattern that is plausible, fresh, and of immense importance for public policy. Third, holding themselves to a pre-analysis plan would have been too constraining, given the scores of variables coded for each individual respondent and that there were no theoretically guided conjectures going into the study. To be sure, no one is suggesting that regressions that do not reflect commitments within pre-analysis plans be excluded from consideration for publication; but anticipation that pre-analysis plans are the new standard will dampen enthusiasm especially of young researchers to persist exploring preliminary results reached at the discovery stage of their research. Fourth, there are a host of procedures that must pass muster for the Whitehall results to be considered probably valid. To their credit, the authors (and their wider research community) have conducted many of these, including more out of sample replications than political scientists typically perform and tests of the status gradient in societies less enamored with status than the UK. While these replications are in some way “tests” of the Whitehall discoveries — the distinction between discovery and testing is not clear cut – they all require attention to covariates and functional form that are themselves subject to fishing incentives. However, as these replications continue to add up, and in the direction heralded in the Whitehall publications, we gain confidence in the relationship between status and health.
But the Whitehall investigators haven’t done the optimum: it evidently took a long time (and pressured by trends in the scientific community for open access) for their Whitehall data to be available for replication and even still data sharing agreements are arbitrary and ad hoc; there are no reports on how many conjectures were found to be insignificant and thus no way to estimate a Bonferroni correction; there is no principled attempt to uncover the mechanism linking autonomy or unfair treatment on the job to health events, and thus causal claims – for which there are all too many in the book – appear suspect. For the purposes of this blog, my big point is that for the outcomes that Marmot and collaborators were seeking to account for – i.e. public health outcomes – the statement that job autonomy was in their pre-analysis plan would make the findings a bit more convincing (though it would have been a great loss if such a plan discouraged them to put that bait on their hook); however, a more transparent set of procedures on what was tested and scrutiny of the raw data by the research community for purposes of replication would have, in my judgment, a higher expected yield in confidence.
In a now-classic paper Mathew McCubbins and Thomas Schwartz (“Congressional Oversight Overlooked: Police Patrols Versus Fire Alarms,” (1984), American Journal of Political Science 28: 165–179) evaluated two approaches of oversight in principal/agent relations. The principal could impose an overarching policing system (the way the FDA operates in drug trials; or peer reviewers in the clever way devised by Uri Simonsohn, Leif Nelson and Joseph Simmons in their forthcoming paper “P-curve: A Key to the File Drawer”). Alternatively the principal could wait for the broader population to complain about (i.e. sound fire alarms) and expose errors in the agent’s product (through a system of replication). While the principals regulating the research environments in biochemistry, psychology, economics and political science all face similar problems in regulating their agents, the efficient policing mechanism for one may be inefficient for the other. For example, the P-curve algorithm is appropriate with a disciplinary practice in which a single laboratory runs the same protocols (with minor deviations) several times. The FDA solution is appropriate where the raw data for experiment are proprietary. My suggestion in this blog is that for the modal works in political science and epidemiology, expanding the rewards given to those who sound fire alarms is the most appropriate avenue for progress in rooting out false positives without constraining discovery.
Indeed, in that the nature of our investigative problems is quite similar, political science has much to learn from the successes and unfulfilled promise of the Whitehall studies, more so than from the bureaucratic police patrols imposed on medical research by the FDA. Optimal transparency procedures, I conclude, depend on the specifics of the research program.