Skip to content
Thursday, April 15
  • About Me
  • Must Watch Videos
  • Proof of Concepts (POCs)

The Tech Check

Tech from one dev to another

The Tech Check

Tech from one dev to another

  • Data Science
  • Tech
  • General
  • Proof of Concepts (POCs)
  • About Me / Products
  • Must Watch Videos
  • Data Science
  • Tech
  • General
  • Proof of Concepts (POCs)
  • About Me / Products
  • Must Watch Videos
Trending Now
  • out() vs. outE() – JanusGraph and Gremlin
  • Getting Started With JanusGraph
  • I made a website which tells if you’re wearing a mask or not – without machine learning
  • Free apps vs. Paid apps
  • Binary Search Tree Implementation in Java
  • Different ways of iterating on a HashMap in Java
Home>>Data Science>>Overfitting and Underfitting models in Machine Learning
Data Science

Overfitting and Underfitting models in Machine Learning

Sunny SrinidhiAugust 2, 2018 1362 Views0

cdc-height-age-chart

In most of our posts about machine learning, we’ve talked about overfitting and underfitting. But most of us don’t yet know what those two terms mean. What does it acutally mean when a model is overfit, or underfit? Why are they considered not good? And how do they affect the accuracy of our model’s predictions? These are some of the basic, but important questions we need to ask and get answers to. So let’s discuss these two today.


The datasets we use for training and testing our models play a huge role in the efficiency of our models. Its equally important to understand the data we’re working with. The quantity and the quality of the data also matter, obviously.

When the data is too less in the training phase, the models may fail to understand the patterns in the data, or fail to understand the correleation of different features with the dependent variable. If the data is not properly divided, we may introduce bias in our models. Now suppose we have quite a good amount of data for training. Once we finish training, we test the model with our test dataset. We get an accuracy north of 95%. That’s awesome. So assuming that our model is super ready, we start giving it unseen data. But to our shock, we see accuracy figures of only 50-60%. This is known as overfitting.

In essence, if we see a high bias in our predictions, we have underfitting problems. And if we see high variance, we have overfitting problems. This happens because our model fails to generalise data that it has not seen before. We have to make sure we select our features properly so that can we remove any unwanted features which act as noise in the data. If there’s more noise in the data, the model might end up fitting to the noise instead of the data, which would lead to overfitting. Similarly, if there’s not enough data, or if the data is too normalized, the model might not be able to detect any kind of pattern or learn from the data. This will lead to underfitting.

One of the ways to detect overfitting or underfitting is to split your dataset into training set and test set. We have already talked about splitting datasets using the SciKit Learn library. By splitting the data, we’ll be able to test the accuracy of our model on unseen data right in the development phase. Once we know how it is performing, we can go back and tune the model.

There are a lot of methods for avoiding fitting issues. We’ve already talked about a few of them, such as cross validation, feature selection, and train-test split. We’ll talking about a few more in the future.

About the author

Sunny Srinidhi

Coding, reading, sleeping, listening, watching, potato. INDIAN.
“If you don’t have time to do it right, when will you have time to do it over?” – John Wooden

See author's posts

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Related

Related tags : data sciencedatasciencefeature reductionfeature selectionfitting issues in machine learningmachine learningoverfitting in machine learningprogrammingpython scikitpython sklearnscikitscikit learnsklearntechunderfitting in machine learning
Share:

Previous Post

Different types of Validations in Machine Learning (Cross Validation)

Next Post

What is multicollinearity?

pearson_correlation.png

Related Articles

Data Science

An intro to text classification with Facebook’s fastText (Natural Language Processing)

Data Science

Lemmatization in Natural Language Processing (NLP) and Machine Learning

apache_kafka_streams Data ScienceTech

Getting started with Apache Kafka Streams

Data Science

Understanding Word N-grams and N-gram Probability in Natural Language Processing

apache_kafka_streams Data ScienceTech

Apache Kafka Streams and Tables, the stream-table duality

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

O’Reilly Live Online Training

Getting Started with Amazon Athena

by Sunny Srinidhi
23rd April, 2021
Register now here
Sunny Srinidhi's DEV Community Profile
AWS_Community_Builder

Follow Us

  • Twitter
  • LinkedIn
  • Medium
  • GitHub

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 20 other subscribers

Recent Posts

  • out() vs. outE() – JanusGraph and Gremlin
  • Getting Started With JanusGraph
  • I made a website which tells if you’re wearing a mask or not – without machine learning
  • Free apps vs. Paid apps
  • Binary Search Tree Implementation in Java

Categories

  • Data Science (43)
  • General (4)
  • Rants (6)
  • Smartphones (1)
  • Tech (71)

Archives

  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • June 2019
  • May 2019
  • April 2019
  • November 2018
  • August 2018
  • July 2018
  • August 2017
  • July 2017
  • June 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • September 2016
  • August 2016
  • March 2016

Tags

ai amazon apache apache kafka apache spark artificial intelligence aws best practices bigdata big data coding data science datascience data structure implementation in java data structures feature reduction feature selection java java data structures java data structures implementation java linked list example java linked list implementation javascript kafka linkedlist linked list in java linked lists machine learning machine learning models ml natural language processing nlp php programming python scikit python sklearn rants scikit scikit learn sklearn spring spring boot tech technology the fasttext series
Sunny Srinidhi | WordPress Theme Ultra Seven
%d bloggers like this: