Common Misuses of Statistical Hypothesis Testing: Part 3
Does a small p-value indicate a higher importance?
In the previous two articles, I discussed misuses of statistical hypothesis testing that we should be aware of. Here, I will explain one last common misuse. The p-value should only be used for comparison to a predetermined significance level, such as 0.05 or 0.01. Have you ever seen a report that compares the p-values of different tests to each other and discusses something? Actually, it does not make sense to compare the p-values of different tests. Let me explain why it is meaningless with an example.
Preparation for the experiment
Again, I will conduct a numerical experiment and explain the results of this experiment. First, as before, I define a function that generates a specified number of samples with a specified mean and standard deviation. Instead of simply generating random samples, we standardize the return value so that it exactly matches the mean and standard deviation of the argument.
def generate_samples(mean=100, stddev=10, samplesize=1000):
import numpy as np
# Generate samples that follow a normal distribution
samples = np.random.normal(1, 1, samplesize)
# Calculate the mean and standard deviation of the generated sample
sample_mean = np.mean(samples)
sample_std = np.std(samples)
# Match the mean and standard deviation of the generated sample to the argument
samples = (samples - sample_mean) * stddev / sample_std + mean
return samples
Below is a function that performs a statistical test using the above function. The code is different from that of part 2, as the experiment is performed differently. Given the mean, standard deviation, and number of samples for samples A and B as arguments, a graph showing their distribution and the results of the t-test are output.
def experiment(meanA, stddevA, samplesizeA, meanB, stddevB, samplesizeB):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, ttest_ind, gaussian_kde
# Generate random samples with given mean and standard deviation
sampleA = generate_samples(meanA, stddevA, samplesizeA)
sampleB = generate_samples(meanB, stddevB, samplesizeB)
# Prepare graph
plt.figure(figsize=(12, 6))
plt.xlim(60, 150)
plt.ylim(-0.005, 0.045)
# Kernel density estimation for sample A
kde1 = gaussian_kde(sampleA)
x1 = np.linspace(60, 150, 1000)
y1 = kde1(x1)
plt.plot(x1, y1, label=f'KDE of Sample A (mean={meanA}, stddev={stddevA}, n={samplesizeA})', c='blue')
# Kernel density estimation for Sample B
kde2 = gaussian_kde(sampleB)
x2 = np.linspace(60, 150, 1000)
y2 = kde2(x2)
plt.plot(x2, y2, label=f'KDE of sample B (mean={meanB}, stddev={stddevB}, n={samplesizeB})', c='red')
# Small dots indicating samples 1 (with jitter)
jitter1 = np.random.normal(0, 0.001, len(sampleA))
plt.scatter(sampleA, jitter1, c='blue', alpha=1/(samplesizeA**.5), label=f'Plots of sample A')
# Small dots indicating samples 2 (with jitter)
jitter2 = np.random.normal(0, 0.001, len(sampleB))
plt.scatter(sampleB, jitter2, c='red', alpha=1/(samplesizeB**.5), label=f'Plots of sample B')
plt.title('Distribution of sample A and sample B')
plt.xlabel('Value')
plt.ylabel('Density')
# Run a t-test
t_stat, p_value = ttest_ind(sampleB, sampleA)
plt.text(60, -0.014, f't-statistic: {t_stat:.4f}', fontsize=12)
plt.text(60, -0.016, f'p-value: {p_value:.4f}', fontsize=12)
# Set significance level
alpha = 0.05
# Evaluate results using p-value
if p_value < alpha:
plt.text(60, -0.018, 'There is a statistically significant difference (p < 0.05)', fontsize=12)
else:
plt.text(60, -0.018, 'No statistically significant difference (p >= 0.05)', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()
Experiments
Let’s try the first experiment with this code.
- Sample A: Mean=100, Standard deviation=10, Number of samples=10
- Sample B: Mean=110, Standard deviation=10, Number of samples=10
Let’s check the difference between the two groups. You can perform the experiment with the following code.
experiment(100, 10, 10, 110, 10, 10)
Here is an example of the results.
P-value = 0.0480 with significant difference. Phew!
Next, let’s change the conditions a little and conduct a second experiment.
- Sample A: Mean=100, Standard deviation=10, Number of samples=100
- Sample B: Mean=103, Standard deviation=10, Number of samples=100
What about the difference between the two groups? The bolded text is where the conditions differ from those in the first experiment.
You can perform the experiment with the following code
experiment(100, 10, 100, 103, 10, 100)
Here is an example of the results.
P-value = 0.0361 with a significant difference. Yoo-hoo!
We are in good shape. Now let’s move on, to the third experiment.
- Sample A: Mean=100, Standard deviation=10, Number of samples=1000
- Sample B: Mean=101, Standard deviation=10, Number of samples=1000
What is the difference between the two groups? The bolded part is where the conditions are different from the previous one.
You can perform the experiment with the following code.
experiment(100, 10, 1000, 101, 10, 1000)
Here is an example of the results.
P-value = 0.0255, with a significant difference. Bravo!
Let’s summarize the main points of these three experiments.
If the p-value is less than the significance level ( assuming 0.05 this time), you would consider that a “statistically significant difference” was obtained, while if the p-value is greater than the significance level, you would conclude that a “statistically significant difference” was not obtained. So, is it correct to understand that experiment 2 with a p-value of 0.0361 has a “more statistically significant result” than experiment 1 with a p-value of 0.0480? Does experiment 3 with a p-value of 0.0255 have an even “more statistically significant difference” than experiment 2 with a p-value of 0.0361? On the other hand, the difference between the means of the two groups in experiment 1 (=10) is bigger than the difference between the means of the two groups in experiment 2 (=3). Similarly, the difference between the two groups in experiment 2 (=3) is bigger than the difference between the two groups in experiment 3 (=1)). What exactly is a “statistically significant difference?”
What does the statistical hypothesis test confirm?
In statistical hypothesis testing, the inference is that “sampling surveys are inevitably biased, but extreme bias rarely occurs many times in a row, so if we observe something that rarely happens, we should assume that a difference actually exists rather than sampling bias. In the process of inference, a “statistically significant difference” is an indicator of “the degree that something rarely happens.” Although it is named “significant difference,” it does not indicate whether the difference is important or not, but rather whether there is a significant difference considering the difference in means, standard deviation, and the number of samples taken as a whole. “Statistically significant different” does not mean importance, meaningfulness, or severity of difference.
Summary
- Statistical significance does not mean importance, meaningfulness, or substantial
- Statistical significance should be compared with the significance level. It does not make sense to compare p-values of multiple tests with each other
Related articles
Part 1 is here. And Part 2 is here.
At the end of a three-part series
I showed three examples of how statistical hypothesis testing could be deceptive. Such erroneous analysis may be done both by the analyst to gain credibility by claiming that there is a statistically significant difference or that statistical methods are used, and by the analyst himself/herself unintentionally due to lack of understanding.
While there are many things that should not be done or that do not make sense to do in statistical hypothesis testing, we rarely see articles on why they should not be done or do not make sense. In these three articles, I aimed to give concrete examples of why these things should not be done and do not make sense to do, so that you can understand them as images. Once you understand these things, you will be able to better understand the usefulness of more advanced topics, such as effect size and power of test. If you are interested, try searching for the keywords “effect size” and “power of test”. Many good articles about them already exist.