If you’ve come across a dataset in your machine learning endeavors which has more than one feature, you’d have also heard of a concept called Feature Selection. Today, we’re going to find out what it is and why we need it.
When a dataset has too many features, it would not be ideal to include all of them in our machine learning model. Some features may be irrelevant for the independent variable. For example, if you are going to predict how much it would cost to crush a car, and the features you’re given are:
- the dimensions of the car
- if the car will be delivered to the crusher or the company has to go pick it up
- if the car has any fuel in the tank
- the color of the car
you can kind of assume that the color of the car is not going to influence the cost of crushing it, at least I hope so. So, it doesn’t make sense to include that feature in the model and make it more complex than it needs to be. It would be wise to eliminate this feature completely.
So, in essence, we use feature selection to remove any kind of unnecessary, irrelevant, or redundant features from the dataset, which will not help in improving the accuracy of the model, but might actually reduce the accuracy.
In their book, “An Introduction to Variable and Feature Selection,” Guyon and Elisseeff write:
The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.
Feature selection is also known as variable selection and attribute selection. But most people easily confuse feature selection with dimensionality reduction. Its true that both of these help in reducing the features in a dataset, but the difference lies in how they approach this. Dimensionality reduction reduces the number of features by creating new features as combinations of existing ones. So all the features are still present in a way, but the total number of features is reduced. But in feature selection, we either retain a feature or remove it completely from the dataset.
In the next few posts, we’ll see more about feature selection, including a few algorithms.