In our previous post, we saw how to perform Backward Elimination as a feature selection algorithm to weed out insignificant features from our dataset. In this post, we’ll checkout the next method for feature selection, which is Forward Selection. As you can already guess, this is going to be the opposite of backward elimination, well kind of. But before that, make sure you make yourself familiar with the concept of P-value.

Similar to backward elimination, even here we have a few steps to follow. We’ll go one by one as usual. But before going in, you need to know that this is going to be a bit more tedious of a job than backward elimination, because you have to create a bunch of simple linear regression models here. And depending on the number of features you have in your dataset, the number of linear regression models you need to create could grow to a huge number pretty quickly. With that in mind, let’s get started.

## Step 1

The first step is very similar to that of backward elimination. Here, we select a significance level, or a P-value. And as you already know, significance level of 5%, or a P-value of 0.05 is common. So let’s stick with that.

## Step 2

This is a pretty tedious step. In this second step, we create a simple regression model for each feature we have in our dataset. So if there are 100 features, we create 100 simple linear regression models. So this could get a lot boring and complicated depending on the number of features in your dataset. But this is also one of the most import step in the process. And once we fit all the simple linear regression models, we calculate the P-value for all of them and identify the feature with the **lowest** P-value.

## Step 3

In the previous step, we identified the feature with the lowest P-value. We’ll add that feature to the simple linear regression models of all other features. So in the second step, we had simple regression models with one feature each. In this step, we’ll have one less linear regression model, but each of them will have two features. Once we do this, we’ll fit the models again and calculate the P-values.

## Step 4

In this step, we have the P-values of all the models we created in the previous step. We identify the feature with the lowest P-value again. We check if this lowest P-value is less than the significance level, or 0.05 in our example. If so, we’ll take that new feature and add it as a feature to all other models. So basically, we’re repeating step 3 with a new feature. We’ll continue this loop until the lowest P-value we get from a model is no longer less than the significance level. Once we reach this stage, we break the loop.

Once we break this loop, we’ll have the model we want, which is the model we created in the iteration before the iteration that broke the loop. Let me explain that. Suppose we had the loop running for 10 iterations. In the 10th iteration we found out that the lowest P-value is more than the significance level. We’ll consider the model before this model, which is the model from the 9th iteration. We don’t consider the last model because this has no significance, as the P-value was more than 0.05. I hope you understood that.

Anyway, you now have the model you’re looking for. The only problem with this forward selection method is the number of iterations and the number of models you end up building, which can easily become difficult to maintain and monitor. But it is a necessary part of the process. I hope I was clear enough in the explanation. Please let me know in the comments below if anything was missing from this, or if you need me to do more explaining.

Hello Sunny, My name is Shubham and I’m very new to this ML field. I recently came across your blog when i was accessing “towardsdatascience” website. Until now, I read three topics from your blog namely feature selection, backward selection, and forward selection. I find your explanation very easy and thorough which is very helpful to someone new like me who has to make its own notes by myself.

Just like to add that your forward selection article step 3 needs more clarity maybe I’m the one who only feels that but it doesn’t;t explain clearly that after adding a feature, we have to calculate the new p-value. So p-value of what? p-value of all models? p-value of combination of features? p-value of recent added features.

This comment is not meant to criticize but as an act of encouragement for writer.

Also, does adding new feature in our model also leads to change in p-value of all the features? or does p-value always remains same?