Multicollinearity understood with the simplest example

Visual simple example of multicollinearity

Shiro Matsumoto
4 min readOct 3, 2023

What is multicolinearity?
In regression analysis, multicolinearity is a phenomenon in which two or more independent variables are highly correlated. There are already many good articles on medium that provide detailed explanations in text. Here, I will use as simple an example as possible to help you understand multicolinearity.

Let us consider a simple example with two independent variables, one dependent variable, and a sample size of 5.

Sample data (by author)

These data could be the price of a house(y), given an area of land(x1) and a total floor area(x2), or the cost of developing a system(y), given the number of screens(x1) and the number of forms(x2). Anyway, let us assume a situation where we want to explain y using x1 and x2.

Let’s first see if we can explain y using x1. At the same time, let’s see if we can explain y using x2. Yes, it is. Single regression analysis.

Figure 1 Single regression analysis on sample data (by author)

Just to be sure, the diagonal solid line is the regression equation resulting from the regression analysis; the length of the dotted line parallel to the y-axis corresponds to the difference between the regression equation and the sample, and the sum of the squares of the lengths of the dotted lines is the residual sum of squares. Conversely, the straight line drawn so that the residual sum of squares is the smallest is the regression line.

Since y increases as x1 increases, the attempt to predict y using x1 seems to work. Similarly, using x2 to predict y seems to work well.

But there is information that does not appear in x1 but is included in x2, while there is information that is not included in x2 but is included in x1. In the example of the price of a property, the price of a property is likely to be if the total floor area is larger even area of land are same, and the price of a property is likely to be higher if the total floor area is the same, but the land area is larger. Conversely, a property with the same price may have a small building on a large lot, or a large building on a smaller lot. In that way, it would seem to better explain y using x1 and x2 at the same time.

This is where multiple regression analysis comes in. So let’s get started. This time, it must be three-dimensional to show the relationship between the three variables: two explanatory variables and one objective variable.

Figure 2 Multiple regression analysis of sample data (by the author)

Compared to the results of the single regression analysis, the residual sum of squares is smaller at 11678, so it appears that the results have a smaller error margin. (In the single regression analysis, they were 12011 and 28752, respectively.)

However, the regression equation obtained is a bit strange.

regression equation: y=1.265 x1 - 0.153 x2 + 136.637

This equation shows the relationship where y becomes larger as x1 increases, but y becomes smaller as x2 increases. This result differs from the results of the single regression analysis we performed earlier, and in the example of property prices, it goes against intuition. Why does this occur? This is an example of multicollinearity. In fact, the correlation coefficient between x1 and x2 is as large as 0.90. This high correlation among explanatory variables is likely to produce multicollinearity, which can produce results that are contrary to the expected trend and difficult to interpret.

Since there were only two explanatory variables here, it was easy to easily check whether multicollinearity was occurring or not. On the other hand, as the number of explanatory variables increases, it becomes increasingly difficult to check whether or not multicollinearity is occurring. When multicollinearity is found to occur, there are several ways to alleviate it.

  • Although this article gave an obvious example of the sign of regression coefficients changing between single and multiple regression analysis, but not all multicollinearity will reverse the sign.
  • The effect of multicollinearity often weakens as the sample size increases, but the effect does not disappear

The following articles may be helpful in identifying whether multicollinearity exists and how to mitigate it if it does.

References

detect multicollinearity

mitigate multicollinearity

There are several ways to mitigate the effects of multicollinearity, including variable selection, principal component analysis, and the adoption of regularization terms such as lasso regression and ridge regression. I couldn’t find an article explaining all of these together, so I will give you the articles that seem to be good individually.

I’m glad to receive you comment.

--

--

Shiro Matsumoto
Shiro Matsumoto

Written by Shiro Matsumoto

Here's something that hasn't been written yet and isn't a copy and paste. Data Scientist in Washington, DC

No responses yet