In most of our posts about machine learning, we’ve talked about overfitting and underfitting. But most of us don’t yet know what those two terms mean. What does it acutally mean when a model is overfit, or underfit? Why are they considered not good? And how do they affect the accuracy of our model’s predictions? These are some of the basic, but important questions we need to ask and get answers to. So let’s discuss these two today.
The datasets we use for training and testing our models play a huge role in the efficiency of our models. Its equally important to understand the data we’re working with. The quantity and the quality of the data also matter, obviously.
When the data is too less in the training phase, the models may fail to understand the patterns in the data, or fail to understand the correleation of different features with the dependent variable. If the data is not properly divided, we may introduce bias in our models. Now suppose we have quite a good amount of data for training. Once we finish training, we test the model with our test dataset. We get an accuracy north of 95%. That’s awesome. So assuming that our model is super ready, we start giving it unseen data. But to our shock, we see accuracy figures of only 50-60%. This is known as overfitting.
In essence, if we see a high bias in our predictions, we have underfitting problems. And if we see high variance, we have overfitting problems. This happens because our model fails to generalise data that it has not seen before. We have to make sure we select our features properly so that can we remove any unwanted features which act as noise in the data. If there’s more noise in the data, the model might end up fitting to the noise instead of the data, which would lead to overfitting. Similarly, if there’s not enough data, or if the data is too normalized, the model might not be able to detect any kind of pattern or learn from the data. This will lead to underfitting.
One of the ways to detect overfitting or underfitting is to split your dataset into training set and test set. We have already talked about splitting datasets using the SciKit Learn library. By splitting the data, we’ll be able to test the accuracy of our model on unseen data right in the development phase. Once we know how it is performing, we can go back and tune the model.
There are a lot of methods for avoiding fitting issues. We’ve already talked about a few of them, such as cross validation, feature selection, and train-test split. We’ll talking about a few more in the future.