Common Misuses of Statistical Hypothesis Testing: Part 1
Does that statistical test make sense?
Dear everyone who receives reports submitted by data scientists. If the report says “There is a significant difference,” don’t you think you have reached a meaningful conclusion? Is that truly meaningful?
What Statistical Hypothesis Tests Compare
Statistical hypothesis testing may seem easy to use as a simple tool, while the rigor in mathematics makes it difficult to precisely understand the theory or explanation . However, there are several patterns of misuse that one tends to fall into if one thinks of it merely as a tool. Even if you skip the mathematical understanding, there are at least some things you should be careful not to use incorrectly.
For example, suppose that the data sample can be divided into two groups, Group A and Group B, as shown in Figure 1. In this case, it may seem that a statistical test can be performed. However, the test might be meaningless.
What does a statistical test confirm? Is it a method to check whether there is a difference between Group A samples and Group B samples as shown in the Figure 2?
No, it is not. A statistical test is not a means to confirm whether or not there is a difference between the samples of Group A and Group B (Figure 2). You might see this explanation given for simplicity, but it is an inaccurate representation at the expense of statistical accuracy.
So, what does a statistical test confirm? Statistical tests are not a method to ensure “the difference between Group A and Group B samples,” but rather a way to confirm “the difference between the populations from which the samples are drawn.
(Statistics) Population refers to the entire group to which the survey subject belongs in a statistical analysis. Due to economic, time, or physical constraints, it is often impossible to survey or observe the whole population. In such cases, statistical hypothesis testing is used to randomly sample a portion of the population and infer differences among the population based on the characteristics of the tested subjects.
Since this is a bit of a math-specific roundabout expression, I will use a concrete example to illustrate the point.
Example of A/B test
- Example 1 :
A shopping site had two designs for its membership enrollment and conducted an A/B test to see which design was preferred. At the end of the test period, the conversion rates for Screen A and Screen B were determined. Is it correct to perform a statistical hypothesis test on the conversion rates of Screen A and Screen B? - Example 2 :
A shopping site has two designs for the purchase and conducted an A/B test to see which design is preferred. This shopping site has a fixed number of users, all existing users. The average purchase amounts for screens A and B were determined at the end of the test period. Is it correct to perform a statistical hypothesis test on the purchase amounts for Screen A and Screen B?
Answer
- Example 1 :
Correct. This is a meaningful test. Both visitors to Screen A during the test period and visitors to Screen B during the test period can be considered a sample was drawn from the population of visitors to Screen A or Screen B during the test period plus subsequent periods. A test can be performed to see if there is a significant difference in conversion rates after the test period. - Example 2 :
Incorrect. This is a meaningless test. Since there is no variation in the site users, both those who visited Screen A and those who visited Screen B during the test period are the population. Therefore, it is meaningless to perform this type of test. The test method can be applied, and either a significant difference or no significant difference can be formally concluded, but the conclusion is not substantively meaningful.
Sample survey and exhaustive survey
As shown in Figure 3, a sample survey is a survey in which a portion of the sample is selected from the population, and the characteristics of the sample are examined. On the other hand, an exhaustive survey is a survey in which the entire population is surveyed, and data is collected and analyzed. For example, the national census is an exhaustive survey; in many cases, employee satisfaction surveys are almost exhaustive surveys. On the other hand, customer satisfaction surveys often have to be sample surveys.
It is not only surveys. For example, you are running a shopping site with 1 million users. You want to analyze the average amount spent by those users in some way of grouping. In this case, the analysis is based on the data analysis of all users, and there is no population of users in the background. Therefore, it makes no sense to conduct a statistical test, and running a test is itself a misuse of the test.
Attractiveness of “Significant Difference”
I think one of the reasons why the misuse of statistical tests tends to occur is the fascinating term “significant difference.” When an analyst conducts a statistical hypothesis test and says, “We found a statistically significant difference,” some people might take this to mean that some meaningful finding has been made. I suspect few analysts may know that the test is meaningless but use it to strengthen their persuasive power and hope they will be perceived as having obtained meaningful results.
I hope that the data scientists who read this article will be better able to judge whether a statistical test is meaningful. I hope that everyone who receives a data scientist’s report will be able to distinguish if the test is meaningful or not.
Summary
- Statistical hypothesis testing requires a population
- Statistical hypothesis testing for an exhaustive survey does not make sense