Now that we know what is feature selection and how to do it, let’s move our focus to validating the efficiency of our model. This is known as validation or cross validation, depending on what kind of validation method you’re using. But before that, let’s try to understand why we need to validate our models.
Validation, or Evaluation of Residuals
Once you are done with fitting your model to you training data, and you’ve also tested it with your test data, you can’t just assume that its going to work well on data that it has not seen before. In other words, you can’t be sure that the model will have the desired accuracy and variance in your production environment. You need some kind of assurance of the accuracy of the predictions that your model is putting out. For this, we need to validate our model. After training your model, an error estimation of the model has to be made, and this is known as evaluation of residuals.
This process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data, is known as validation.
Validation will give us a numerical estimation of the difference between the estimated data and the actual data in our dataset. But the problem is, this only gives us an idea of how the model is performing with the data that was used to train it. This doesn’t tell us anything about unseen data. There could be chances that the model is either overfitting or underfitting the data. We need to get an idea of how the model will perform with new data. And for that, we need Cross Validation.
Cross Validation
There are different types of cross validation methods, and they could be classified into two broad categories – Non-exhaustive and Exhaustive Methods. We’re going to look at a few examples from both the categories.
Non-Exhaustive Methods
Holdout Method
We now know that we split our entire dataset into two sets, training set and test set. For cross validation, we’ll be splitting the training set again into two sets, one will remain the training set, and the other will be known as the validation set. As you guessed by now, we’ll be using the validation set for cross validating our model.
The holdout method is the easiest cross validation methods available. In this method, as already discussed, we split our training set and take out a small part as the validation set. We train our model with the new and smaller training set, and validate the accuracy of the model on the validation set, which is still unseen by the model. But there’s a problem here. Because we don’t know what data will be in the training set and what data will be in the validation set, we might end up with high variance. For different sets of training and validation sets, we might end up with different results. To avoid this, we’ll be using a variation of the holdout method, which we’ll see next.
K-Fold Cross Validation
When we split our training dataset to get a validation set, there’s always a risk of losing some crucial data from the training set, or of losing patterns which might go unnoticed by the model. This will in turn lead to overfitting or underfitting. To avoid this we need enough amount of data in both the training set and the validation set. And for this, we use K-Fold Cross Validation.
In this method, the original training set is divided into k subsets. The holdout method is now repeated k times with different datasets. In each fold, one of the k subsets is taken as the validation set, and the remaining k – 1 subsets are used as the training set. The error estimations from all the folds are taken and averaged to give us the final error estimation of the model.
Because we’re using all the k sets for validation, each datapoint appears in the validation set exactly once. And each point of data appears in the training set exactly k – 1 times. This greatly improves the accuracy of the model. This also reduces bias as most of the data is being used for fitting, and reduces variance as most of the data is also being used for validation. And because we’re interchanging the data in each fold, it improves the overall efficiency of the model.
Depending on your dataset, you can select a k value of your own. But in most cases, k = 5 or k = 10 is preferred. As with most things machine learning, nothing is written in stone and it completely depends on your data.
Stratified K-Fold Cross Validation
In some cases, data might not have been divided properly betwen the training and validation sets. For example, in a classification problem, there may be a large number of negative outcomes in a validation set, and in another validation set, it could be the opposite. This will again lead to bias and high variance in the outcome.
To avoid this, we make a slight modfification in the K-Fold Cross Validation method, such that in each set, we make sure there are equal or close to equal results of all categories. In case of continuous values, we make sure the means of all the outcomes are comparable. This variation of K-Fold is known as the Stratified K-Fold Cross Validation.
Exhaustive Methods
We’ll look at one most commonly used exhaustive methods, which is the Leave-P-Out Cross Validation.
Leave-P-Out Cross Validation
In this method, if there are n data points, n – p data points are taken in one iteration and the remaining p data points are used for validation. This kind of iteration goes on for all possible combinations of p from the original dataset. The errors from all these iterations are averaged to get the final efficiency figure.
This is classified as an exhuastive method because the model has to be trained for all combinations of the dataset. If you choose a high value for p, this wil be even more exhaustive.
Leave-One-Out Cross Validation
This is a variation of the Leave-P-Out cross validtion method, where the value of p is 1. This is much less exhaustive as the value of p is very low. This means the number of possible combinations is n, where n is number of data points.
As you can see, cross validation really helps in evaluating the effectiveness of the model you’re designing. Cross validating your model is a good practive, vs. using your model straight after training it without any idea of how it will perform with unseen data.
The good news for applied-sciences engineers is, libraries such as SciKit Learn provide all the tools required to perform crazy operations like this on our models. We’ll be talking about this very soon.