Multicollinearity is a term we often come across when we’re working with multiple regression models. Even we have talked about it in our previous posts, but do we know what it actually means? Today, we’ll try to understand that.
In most real life problems, we usually have multiple features to work with. And not all of them are in the format that we, or the model, wants. For example, a lot of categorical features are usually in the text format. But as we already know, our models require the features to be numerical. For this, we will label encode the feature and if required, we’ll even one hot encode them.
But in some cases, we might have features whose values can be easily determined by the values of other features. In other words, we can see a very good correlation between the values of features. The correlation could be so high that you could predict the values of one feature based on the values of the other. This obviously leads to redundant information in the input, which will lead to erroneous prediction from the model.
How do we detect correlation between features? Well, by calculating the correlation coefficients between a pair of features. Correlation coefficient is a measure of the correlation between two features. One of the most commonly used methods is the Pearson Correlation method, which is also known as the Pearson Product Moment Correlation (PPMC). This method assigns a value for the coefficient in the range of -1 to 1. A value of -1 would mean that every change in the value of one feature negatively or inversely affects the value of the other feature. And a value of 1 would mean that the values are directly proportional. A value of 0 would mean that there’s no correlation at all. So our goal is to find sets of features whose correlation coefficients are as close to 0 as possible.
But, as with any other method, Pearson Correlation method comes with one drawback. The method can’t differentiate between an independent variable (or a feature) and a dependent variable. For example, if you’re trying to find the correlation between a high calorie diet and diabetes, you’d get a coefficient somewhere north of 0.5. But if you switch the two, you’ll still get similar results. This would mean diabetes causes high calorie diet. Now that doesn’t make any sense. So if you’re using Pearson Correlation method, you should be wary of what features you select.