In our previous post, we discussed what is feature selection and why we need feature selection. In this post, we’re going to look at the different methods used in feature selection. There are three main classification of feature selection methods – Filter Methods, Wrapper Methods, and Embedded Methods. We’ll look at all of them individually.
Filter methods are learning-algorithm-agnostic, which means they can be employed no matter which learning algorithm you’re using. They’re generally used as data pre-processors. In filter methods, each individual feature in the dataset will be scored on its correlation with the dependent variable. A variety of statistical tests will be used to calculate this correlation score. Based on this score, it will be decided whether to retain a feature in the dataset or to remove it. There are a number of such statistical tests for this, and we’ll have a brief introduction to four of them:
- Pearson’s Correlation – We’ll use this when the feature and the independent variable are both continuous values. The value calculated will be in the -1 to +1 range.
- Linear Discriminant Analysis (LDA) – We’ll use this when we have to find out if a linear combination of various continuous features will be able to differentiate between various categories.
- Analysis Of Variance (ANOVA) – This one is the opposite of LDA. Here we’ll see if means of several groups of categorical features are same or not. This will let us understand the relationship between a group of categorical features and a continuous dependent variable.
- Chi-Square – This is used to find out the degree of correlation between a group of categorical features and a categorical dependent variable using their frequency distribution.
These may sound like the simplest methods (and they are to some extent), but they come with some drawbacks as well. One important drawback is that they don’t handle multicollinearity.
Multicollinearity is a phenomenon we see in multiple regression models, where one feature could be linearly predicted with great accuracy based on the other features in the dataset. This isn’t necessarily bad, and it doesn’t affect the accuracy of prediction of the model much, but it will greatly affect the values of the coefficients. For a minute change in the variance of features, there could be huge changes in the values of the coefficients. So, if you’re going to use any filter method for your feature selection, you better make sure you’re handling multicollinearity as well.
In the wrapper methods, we start by randomly selecting a subset of features and train our model with this feature subset. We’ll then check the accuracy of the model with the test dataset. If we’re satisfied with the predictions, well, we’re done. If not, we start adding more features to our subset of features, or start removing some features from our subset. As you can image, we may have to run quite a lot of iterations of the model to end up with a subset of features which gives us the desired level of accuracy. So these methods are computationally pretty expensive.
Similar to Filter Methods, we have a few options to calculate what features have to be added or removed from our subest. Let’s look at a few common methods.
- Forward Selection – As already discussed, this is an iterative algorithm. We’ll start off without any features in the model. We’ll then start adding features one by one in each iteration and train the model. We’ll continue iterating till we add a feature which will not significantly improve our model’s accuracy. Once we reach this stage, we’ll retain all the features that we already have in our subset and discard the other features.
- Backward Elimination – You might have figured this out by now, backward elimination is the opposite of forward selection. In this method, we start the iteration with all the features. In each iteration we remove one feature and see if there’s a significant drop in the accuracy of the model. We’ll continue iterating till we reach a point where there’s no effect of removing a feature from the subset.
- Recursive Feature Elimination – This could be thought of as a hybrid of forward selection and backward elimination. Recursive feature elimination tries to find a subset of features which would give the best performing model. In each iteration, the method marks the best performing feature and the worst performing feature. For the next iteration, it will select features from the remaining set of features. This will continue until all the features are exhausted. The features will be ranked according to the order of their elimination.
Embedded methods are a hybrid of the filter and wrapper methods. Embedded methods combine the qualities of filter methods and wrapper methods so that we can get a good performing model.
A couple of common methods are LASSO regression and Ridge regression algorithms. Both of these have built-in functions for feature selection which avoid over-fitting. We’ll be looking into these algorithms in more detail in the future.