p-Hacking: A Demonstration

This note is a simple demonstration of p-hacking, which can be broadly described as the practice of reaching p <= 0.05 by leveraging all the degrees of freedom researchers enjoy. As an aside, not everyone likes the name p-hacking!

For this demonstration, I am going to show you how not deciding sample size ex ante and continuing data collection until one obtains statistically significant results can be wrong. Consider an experimental researcher who starts with a sample size (more accurately cell size) of 30. However, they don’t find statistically significant results in the first experiment. So they increase the sample size to 60 in the next study and get a p < 0.05.

That’s awesome, isn’t it? The problem with this apprach is that there is no scientific basis for picking the sample size. Why did the researcher stop at 60? What if they had increased the sample size to 90? Would they get statistically significant results? In other words, if we get a statistically significant result at a specific sample size does that guarantee that a higher sample size will also lead to statistically significant result?

As I show below the answer to the last question is no. For this example, I am going to create a vector of 501 values with the following data generating process (dgp):

\[x_i \sim N(\mu = 1, \sigma = 9) \]

A few of my friends pointed out that the standard error is really large compared to the mean. I agree but I am using extreme values to make my point more salient.

Next, I perform a series of t-tests on this vector to test the null hypothesis that the population mean is zero. Note that our dgp tells us that the population mean is 1 so the t-test should reject the null hypothesis.

I first use a sample of first two values from the vector, perform a t-test, and record the p-value. I keep on increasing the sample size by 1 unit until I exhaust the vector. This results in a drawing of 500 samples from the vector and corresponding 500 p-values.

The R code that generated these numbers is as follows:

# Set the random number seed for replicability

# Create a vector of 501 values
r1 = rnorm(501,1,9)

# Create an empty data frame before the loop
d31 = data.frame(sample = NA, p_val = NA)

# Write a loop to iteratively draw samples of increasing sample size, perform t-test, and get the p-value
for (i in 2:501) {
  d31[i - 1,2] = round(t.test(r1[1:i],mu = 0)$p.value,3)
  d31[i - 1,1] = i

# Note the p-values are rounded to 3rd decimal place for better visualization.

I used highcharter package in R to make a nice graph:

hc1 = hchart(d31, "line", lineWidth = 1,hcaes(x = sample, y = p_val)) %>%
  hc_xAxis(title = list(text = "Number of Observations in the Sample")) %>%
  hc_yAxis(title = list(text = "p-value"),
           plotLines = list(
             list(label = list(text = "p-value = 0.05",
                               align = "right", x = -10),
                  color = "#FF0000",
                  width = 1,
                  value = 0.05))) %>%
  hc_add_theme(hc_theme_flat()) %>%
  hc_colors(cols) %>%
  hc_exporting(enabled = TRUE)
frameWidget(hc1, height = '400')

If you hover your mouse on the series, you will be able to read the exact points.

In this case we obtain desired 5% statistical significance only at a sample size of 163. After that point, the p-value goes down. However, it goes back up when the sample size is 185! p-value is 0.051 even at a sample size of 288. The takeaway from this exercise is that p-value is not monotonically decreasing as the sample size increases. Thus, when researchers keep collecting data until they reach statistical significance, it might just be a spurious result. If they didn’t stop and instead collected more data, they might have hit statistical non significance as in the above example.

Below, I generate similar graphs for 6 random number seeds to demonstrate that p-values don’t always decrease monotonically with increasing sample size.