Common Misuses of Statistical Hypothesis Testing: Part 2

How well do big data and statistical hypothesis testing fits together?

8 min readSep 27, 2023

Statistical hypothesis testing seems like an easy-to-use and easy-to-understand tool, but it can be deceptive in some ways. Have you ever seen a report in which a statistical test is applied to big data and the result proudly concludes that “there is a significant difference”? In fact, statistical tests on big data with more than a million samples, while mathematically correct, are of little practical use. Let me explain why.

Experiments on the relationship between sample size and statistical significance

I will now conduct some numerical experiments to illustrate. Before proceeding, I define a function that will be used in the following sections. The following function generates a specified number of samples with a specified mean and standard deviation. Instead of simply generating random samples and returning them directly, the mean and standard deviation of the generated samples are calculated and the samples are standardized to match the mean and standard deviation of the arguments. Thus, the mean and standard deviation of the generated samples will equal the arguments.

def generate_samples(mean, stddev, samplesize):
  import numpy as np

  # Generate samples that follow a normal distribution
  samples = np.random.normal(1, 1, samplesize)
  
  # Calculate the mean and standard deviation of the generated sample
  sample_mean = np.mean(samples)
  sample_std = np.std(samples)

  # Match the mean and standard deviation of the generated sample to the argument
  samples = (samples - sample_mean) * stddev / sample_std + mean

  return samples

Using this generate_samples function,

Sample A: 10 samples with mean=100 and standard deviation=10
Sample B: 10 samples with mean=100.1 and standard deviation=10

The distribution of the samples is plotted, and a t-test is performed to see if there is a statistically significant difference between Sample A and Sample B.

Here is the experimental code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, ttest_ind, gaussian_kde

SAMPLE_SIZE = 100000

# Generate samplesize random samples with mean 100 and standard deviation 10
sampleA = generate_samples(100, 10, SAMPLE_SIZE)

# Generate samplesize random samples with mean 100.1 and standard deviation 10
sampleB = generate_samples(100.1, 10, SAMPLE_SIZE)

# Run a t-test
t_stat, p_value = ttest_ind(sampleB, sampleA)

# Prepare graph
plt.figure(figsize=(12, 6))
plt.ylim(-0.005, 0.045)

# Kernel density estimation for sample A
kde1 = gaussian_kde(sampleA)
x1 = np.linspace(60, 140, 1000)
y1 = kde1(x1)
plt.plot(x1, y1, label=f'KDE of Sample A (mean=100, n={SAMPLE_SIZE})', c='blue')

# Kernel density estimation for Sample B
kde2 = gaussian_kde(sampleB)
x2 = np.linspace(60, 140, 1000)
y2 = kde2(x2)
plt.plot(x2, y2, label=f'KDE of sample B (mean=100.1, n={SAMPLE_SIZE})', c='red')

# Small dots indicating samples 1 (with jitter)
jitter1 = np.random.normal(0, 0.001, len(sampleA))
plt.scatter(sampleA, jitter1, c='blue', alpha=1/(SAMPLE_SIZE**.5), label=f'Plots of sample A (mean=100, n={SAMPLE_SIZE})')

# Small dots indicating samples 2 (with jitter)
jitter2 = np.random.normal(0, 0.001, len(sampleB))
plt.scatter(sampleB, jitter2, c='red', alpha=1/(SAMPLE_SIZE**.5), label=f'Plots of sample B (mean=100.1, n={SAMPLE_SIZE})')

plt.title('Distribution of sample A and sample B')
plt.xlabel('Value')
plt.ylabel('Density')

plt.text(60, -0.014, f't-statistic: {t_stat:.4f}', fontsize=12)
plt.text(60, -0.016, f'p-value: {p_value:.4f}', fontsize=12)

# Set significance level
alpha = 0.05

# Evaluate results using p-value
if p_value < alpha:
    plt.text(60, -0.018, 'There is a statistically significant difference (p < 0.05)', fontsize=12)
else:
    plt.text(60, -0.018, 'No statistically significant difference (p >= 0.05)', fontsize=12)

plt.legend()
plt.grid(True)
plt.show()

Here is an example of the result. The plot near the x-axis shows the sample values, and the curve is the kernel density estimate of the sample distribution. Jitter is added so that the plots are somewhat visible even if they overlap.

There is a difference of 0.1 in the mean between Sample A and Sample B, but the difference was not considered statistically significant different.

Now, rewrite the experimental code as follows.

sample_size = 1000

and set each sample size to 1000, what would the result be? Here are the results.

Even with 1,000 samples each, the difference of 0.1 mean between Sample A and Sample B is insignificant.

Let’s try the following

SAMPLE_SIZE = 100000

What would happen if we did it this way? Here are the results.

Yay, we finally got results with significant differences!

Should we be pleased? Is this something to celebrate?

Let’s look at Figure 3 again. The blue and red lines are almost indistinguishable. That’s because they are almost identical: one is a normally distributed graph with mean=100 and standard deviation=10, and the other is a normally distributed graph with mean=100.1 and standard deviation=10. Even if a data scientist reports to a manager with this result, “I have obtained a statistically significant difference, so let’s choose Sample B as the target of this action,” the manager is likely to reply, “Sample A and Sample B are almost the same (aside from the statistical significance).” It is important to note that there is no relationship between the presence or absence of statistical significance and the size of the difference (0.1) between Sample A (average 100) and Sample B (average 100.1). In both Figure 1, Figure 2, and Figure 3, Sample A had a mean of 100 standard deviations of 10, and Sample B had a mean of 100.1 standard deviations of 10. The only difference was the number of samples. However, there was no significant difference in the examples in Figure 1 and Figure 2, and there was a significant difference in Figure 3. This means that statistical tests on big data with more than 100,000 or 1,000,000 samples will almost always yield a “significant difference” result. Even if a statistical test for big data yields a “significantly different” result, the difference between the groups may be negligible. If statistical tests on big data yield “significant difference” results in many cases, but the difference between groups may be negligible, In reality, there is little sense in conducting statistical tests on big data.

Let’s go over this again to be sure.

Let’s take a sample size of 10, 100, 1000, 10000, 100000, 1000000, repeat the process 1000 times, and output the results.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, ttest_ind, gaussian_kde

# Number of simulations
SIMULATIONS = 1000

# List of sample sizes
SAMPLE_SIZE = [10, 100, 1000, 10000, 100000, 1000000]
THRESHOLD = 0.05

# initialize counter
significant_count = {size: 0 for size in SAMPLE_SIZE}
non_significant_count = {size: 0 for size in SAMPLE_SIZE}

# Simulation Run
for sample_size in SAMPLE_SIZE:
  for _ in range(SIMULATIONS):
    sampleA = generate_samples(100, 10, sample_size)
    sampleB = generate_samples(100.1, 10, sample_size)
    t_stat, p_value = ttest_ind(sampleA, sampleB)
    if p_value < THRESHOLD:
      significant_count[sample_size] += 1
    else:
      non_significant_count[sample_size] += 1
  print(f"Sample Size: {sample_size}")
  print(f"Significant Differences: {significant_count[sample_size]}")
  print(f"Non-significant Differences: {non_significant_count[sample_size]}")
  print()

The results are shown below.

Sample Size: 10
Significant Differences: 0
Non-significant Differences: 1000

Sample Size: 100
Significant Differences: 0
Non-significant Differences: 1000

Sample Size: 1000
Significant Differences: 0
Non-significant Differences: 1000

Sample Size: 10000
Significant Differences: 0
Non-significant Differences: 1000

Sample Size: 100000
Significant Differences: 1000
Non-significant Differences: 0

Sample Size: 1000000
Significant Differences: 1000
Non-significant Differences: 0

The result was that up to a sample size of 10,000, there was no significant difference in all 1,000 times out of 1,000, and for a sample size of 10,000 or larger, there was a significant difference in all 1,000 times out of 1,000. This result is the same whether the number of experiments is 10,000 or 100,000, with a clean separation after a certain number of samples. This is because the “test statistics” of the samples generated by generate_samples function will always be equal if the number of samples is the same. What is this “test statistic”?

Common Procedures for Statistical Hypothesis Testing

Let me briefly explain the theoretical aspects of why these results are obtained.

There are various statistical hypothesis testing methods depending on the data of interest: t-tests (one-group t-test, two-group t-test without correspondence, two-group t-test with correspondence, Welch’s t-test), Mann-Whitney U-test, Wilcoxon signed rank test, χ-square test, and so on. There are many test methods, but all of them generally follow the following calculation procedure.

Calculate the “test statistic” from the sample. The “test statistic” is an intermediate product of the statistical hypothesis test.
Calculate p-values from a test statistic and a certain statistical distribution. The p-value is a value that indicates the probability that the observed data is the result of chance.
Compare the p-value to 0.05 (or 0.01, 0.001) to determine whether there is a statistically significant difference. If the p-value is small, we infer that the observed data is unlikely to be the result of chance.

The formulas for calculating statistical test quantities and the statistical distributions used differ for each testing method. It is not the main purpose of this article to explain them, so please refer to textbooks or other websites.

In the case of the t-test used in the example, as the number of samples increases, the absolute value of the test statistic increases and the p-value become smaller. This relationship of the p-value becoming smaller as the number of samples increases is common to all statistical hypothesis testing methods. This is why statistical tests on large samples often lead to the conclusion “ statistically significant difference”. But, there might be only tiny difference between the actual samples.

For the author, the relationship between the number of samples and the meaning of conducting statistical hypothesis tests is organized as follows.

the relationship between the number of samples and the statistical hypothesis testing (by author)

The final judgment is left to users

Even if a test with a large sample, such as 100,000 or 1,000,000, is processed by a statistical library such as scipy, you will not see a warning such as “ the sample size is too large”. This is because mathematically the test itself is feasible. Apart from whether it is mathematically feasible, the user must understand and judge what the test can and cannot say.

In the past, experiments and surveys have been costly. Statistical hypothesis testing has been developed and evolved to conclude that something is mathematically specific in a limited sample. The proponents of various testing methods may not have envisioned today’s big data era. To utilize statistical hypothesis testing methods, which are an asset built in the past, in the current Big Data era, we, as data scientists, who are the asset users, need to be careful.

Summary

Statistical hypothesis testing for big data, such as 1M samples, hardly make sense

Part 1 is here, and Part 3 is here.