Wikipedia defines P-Hacking, also called “data-dredging” or “data fishing” as the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect.
“This is done by performing many statistical tests on the data and only paying attention to those that come back with significant results, instead of stating a single hypothesis about an underlying effect before the analysis and then conducting a single test for it.
“The process of data dredging involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching – perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.
“Conventional tests of statistical significance are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the null hypothesis). This level of risk is called the significance. When large numbers of tests are performed, some produce false results of this type, hence 5% of randomly chosen hypotheses turn out to be significant at the 5% level, 1% turn out to be significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be statistically significant but misleading, since almost every data set with any degree of randomness is likely to contain (for example) some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results.”
Writing in Freakonomics, Arthur Charpentier (reference below) notes six techniques, that he considers cheating, that can be used to get a “good” model by targeting the p-value:
- Stop collecting data when P=<0.05
- Analyze many measures, but report only those where P=<0.05
- Collect and analyze many conditions, but only report those with P=<0.05
- Use covariates to get P=<0.05
- Exclude participants to get P=<0.05
- Transform the data to get P=<0.05
Atlas topic, subject, and course
Wikipedia, Data Dredging, https://en.wikipedia.org/wiki/Data_dredging, accessed 2 December 2018.
Charpentier, Arthur (2015), P-Hacking, or Cheating on a P-Value, Freakonomics, at https://freakonometrics.hypotheses.org/19817, accessed 2 December 2018.
Page created by: Alec Wreford and Ian Clark, last modified 2 December 2018.
Image: WikiVisually, Data dredging, at https://wikivisually.com/wiki/Data_dredging, accessed 2 December 2018.