Fit vs. Transform in SciKit libraries for Machine Learning

Data Science

by Sunny Srinidhi - November 7, 2019November 7, 20190

We have seen methods such as fit(), transform(), and fit_transform() in a lot of SciKit's libraries. And almost all tutorials, including the ones I've written, only tell you to just use one of these methods. The obvious question that arises here is, what do those methods mean? What do you mean by fit something and transform something? The transform() method makes some sense, it just transforms the data, but what about fit()? In this post, we'll try to understand the difference between the two. To better understand the meaning of these methods, we'll take the Imputer class as an example, because the Imputer class has these methods. But before we get started, keep in mind that fitting something like an imputer

ColumnTransformer in SciKit for LabelEncoding and OneHotEncoding in Machine Learning

Data Science

by Sunny Srinidhi - November 6, 2019November 6, 20193

In a very old post - Label Encoder vs. One Hot Encoder in Machine Learning - I had demonstrated how to use label encoding and one hot encoding to separate out categorical text data into numbers and different columns. But the SciKit library has come a long way since I wrote that post, and it has made life a lot more easier. The developers of the library might have realised that people use LabelEncoding and OneHotEncoding very frequently. So they decided to come up with a new library called the ColumnTransformer, which will basically combine LabelEncoding and OneHotEncoding into just one line of code. And the result is exactly the same. In this post, we'll quickly take a look at

Overfitting and Underfitting models in Machine Learning

Data Science

by Sunny Srinidhi - August 2, 20180

In most of our posts about machine learning, we've talked about overfitting and underfitting. But most of us don't yet know what those two terms mean. What does it acutally mean when a model is overfit, or underfit? Why are they considered not good? And how do they affect the accuracy of our model's predictions? These are some of the basic, but important questions we need to ask and get answers to. So let's discuss these two today. The datasets we use for training and testing our models play a huge role in the efficiency of our models. Its equally important to understand the data we're working with. The quantity and the quality of the data also matter, obviously. When the data

Different types of Validations in Machine Learning (Cross Validation)

Data Science

by Sunny Srinidhi - August 1, 20180

Now that we know what is feature selection and how to do it, let's move our focus to validating the efficiency of our model. This is known as validation or cross validation, depending on what kind of validation method you're using. But before that, let's try to understand why we need to validate our models. Validation, or Evaluation of Residuals Once you are done with fitting your model to you training data, and you've also tested it with your test data, you can't just assume that its going to work well on data that it has not seen before. In other words, you can't be sure that the model will have the desired accuracy and variance in your production environment. You need

Different methods of feature selection

Data Science

by Sunny Srinidhi - July 31, 2018November 6, 20191

In our previous post, we discussed what is feature selection and why we need feature selection. In this post, we're going to look at the different methods used in feature selection. There are three main classification of feature selection methods - Filter Methods, Wrapper Methods, and Embedded Methods. We'll look at all of them individually. Filter Methods Filter methods are learning-algorithm-agnostic, which means they can be employed no matter which learning algorithm you're using. They're generally used as data pre-processors. In filter methods, each individual feature in the dataset will be scored on its correlation with the dependent variable. A variety of statistical tests will be used to calculate this correlation score. Based on this score, it will be decided whether to

Linear Regression in Python using SciKit Learn

Data Science

by Sunny Srinidhi - July 30, 2018July 30, 20181

LinearRegression_ScatterPlot_TrainingSet

Today we'll be looking at a simple Linear Regression example in Python, and as always, we'll be using the SciKit Learn library. If you haven't yet looked into my posts about data pre-processing, which is required before you can fit a model, checkout how you can encode your data to make sure it doesn't contain any text, and then how you can handle missing data in your dataset. After that you have to make sure all your features are in the same range for the model so that one feature is not dominating the whole output; and for this, you need feature scaling. Finally, split your data into training and testing sets. Once you're done with all that, you're ready to start your

Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?

Data Science

by Sunny Srinidhi - July 27, 2018November 5, 20191

When you're working with a learning model, it is important to scale the features to a range which is centered around zero. This is done so that the variance of the features are in the same range. If a feature's variance is orders of magnitude more than the variance of other features, that particular feature might dominate other features in the dataset, which is not something we want happening in our model. The aim here is to to achieve Gaussian with zero mean and unit variance. There are many ways of doing this, two most popular are standardisation and normalisation. No matter which method you choose, the SciKit Learn library provides a class to easily scale our data. We can use the StandardScaler

How to split your dataset to train and test datasets using SciKit Learn

Data Science

by Sunny Srinidhi - July 27, 2018November 5, 20192

When you're working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. For this, you'll a dataset which is different from the training set you used earlier. But it might not always be possible to have so much data during the development phase. In such cases, the obviously solution is to split the dataset you have into two sets, one for training and the other for testing; and you do this before you start training your model. But the question is, how do you split the data? You can't possibly manually split the dataset into two. And you also have to make sure you split

Handle missing data in your training dataset with SciKit Imputer

Data Science

by Sunny Srinidhi - July 27, 2018November 5, 20192

Most often than not, you'll encounter a dataset in your data science projects where you'll have missing data in at least one column. In some cases, you can just ignore that row by taking it out of the dataset. But that'll not be the case always. Sometimes, that row would be crucial for the training, maybe because the dataset itself is very small and you can't afford to lose any row, or maybe it holds some important data, or for some other reason. When this is the case, a very important question to answer is, how do you fill in the blanks? There are many approaches to solving this problem, and one of them is using SciKit's Imputer class. If you're

Label Encoder vs. One Hot Encoder in Machine Learning

by Sunny Srinidhi - July 27, 2018November 6, 201911

Update: SciKit has a new library called the ColumnTransformer which has replaced LabelEncoding. You can check out this updated post about ColumnTransformer to know more. If you're new to Machine Learning, you might get confused between these two - Label Encoder and One Hot Encoder. These two encoders are parts of the SciKit Learn library in Python, and they are used to convert categorical data, or text data, into numbers, which our predictive models can better understand. Today, let's understand the difference between the two with a simple example. Label Encoding To begin with, you can find the SciKit Learn documentation for Label Encoder here. Now, let's consider the following data: In this example, the first column is the country column, which is all