In their book The Ethical Algorithm (reference below), Michael Kearns and Aaron Roth, define p-hacking as repeatedly performing the same experiment, or repeatedly running different statistical tests on the same dataset, but then only reporting the most interesting results.
They note (pages 144-145) that:
“It is a technique that scientists can use (deliberately or unconsciously) to try to get their results to appear more significant (remember … that p-values are a commonly used measure of statistical significance). It isn’t a statistically valid practice, but it is incentivized by the structure of modern scientific publishing. This is because not all scientific journals are created equal: like most other things in life, some are viewed as conferring a higher degree of status than others, and researchers want to publish in these better journals. Because they are more prestigious, papers in these journals will reflect better on the researcher when it comes time to get a job or to be promoted. At the same time, the prestigious journals want to maintain their high status: they want to publish the papers that contain the most interesting and surprising results and that will accumulate the most citations. These tend not to be negative results. If goji berries improve your marathon time, the world will want to know! If goji berries have no effect on your athletic performance, there won’t be any headlines written about it.
“This is a game – in the game-theoretic … and in its equilibrium the prestigious journals become highly selective, rejecting most papers. In this game, researchers have a huge incentive to find results that appear to be statistically significant. At the same time, researchers don’t bother investing time and effort into projects that they don’t believe have a shot at appearing in one of the prestigious journals. And negative results – for example, reports of treatments that didn’t work – are not the sorts of things that will get published in high-prestige venues. The result is that even in the absence of explicit p-hacking by individual researchers or teams, published papers represent an extremely skewed subset of the research that has been performed in aggregate. We see reports of surprising findings that defy common sense – but we don’t see reports of experiments that went exactly the way you would have guessed. This makes it hard to judge whether a reported finding is due to a real discovery or just dumb luck.
“And note that this effect doesn’t require any bad behavior on the part of the individual scientists, who might all be following proper statistical hygiene. We don’t need one scientist running a thousand experiments and only misleadingly reporting the results from one of them, because the same thing happens if a thousand scientists each run only one experiment (each in good faith), but only the one with the most surprising result ends up being published.”
Writing in Freakonomics, Arthur Charpentier (reference below) notes six techniques, that he considers cheating, that can be used to get a “good” model by targeting the p-value:
- Stop collecting data when P=<0.05
- Analyze many measures, but report only those where P=<0.05
- Collect and analyze many conditions, but only report those with P=<0.05
- Use covariates to get P=<0.05
- Exclude participants to get P=<0.05
- Transform the data to get P=<0.05
Atlas topic, subject, and course
Michael Kearns and Arron Roth (2019, The ethical algorithm: the science of socially aware algorithm design, New York: Oxford University Press.
Charpentier, Arthur (2015), P-Hacking, or Cheating on a P-Value, Freakonomics, at https://freakonometrics.hypotheses.org/19817, accessed 2 December 2018.
Page created by: Alec Wreford and Ian Clark, last modified 3 March 2021.
Image: WikiVisually, Data dredging, at https://wikivisually.com/wiki/Data_dredging, accessed 2 December 2018.