How to split your dataset to train and test datasets using SciKit LearnData Science by Sunny Srinidhi - July 27, 2018November 5, 20192 When you're working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. For this, you'll a dataset which is different from the training set you used earlier. But it might not always be possible to have so much data during the development phase. In such cases, the obviously solution is to split the dataset you have into two sets, one for training and the other for testing; and you do this before you start training your model. But the question is, how do you split the data? You can't possibly manually split the dataset into two. And you also have to make sure you split
Handle missing data in your training dataset with SciKit ImputerData Science by Sunny Srinidhi - July 27, 2018November 5, 20192 Most often than not, you'll encounter a dataset in your data science projects where you'll have missing data in at least one column. In some cases, you can just ignore that row by taking it out of the dataset. But that'll not be the case always. Sometimes, that row would be crucial for the training, maybe because the dataset itself is very small and you can't afford to lose any row, or maybe it holds some important data, or for some other reason. When this is the case, a very important question to answer is, how do you fill in the blanks? There are many approaches to solving this problem, and one of them is using SciKit's Imputer class. If you're
Label Encoder vs. One Hot Encoder in Machine LearningData ScienceTech by Sunny Srinidhi - July 27, 2018November 6, 201911 Update: SciKit has a new library called the ColumnTransformer which has replaced LabelEncoding. You can check out this updated post about ColumnTransformer to know more. If you're new to Machine Learning, you might get confused between these two - Label Encoder and One Hot Encoder. These two encoders are parts of the SciKit Learn library in Python, and they are used to convert categorical data, or text data, into numbers, which our predictive models can better understand. Today, let's understand the difference between the two with a simple example. Label Encoding To begin with, you can find the SciKit Learn documentation for Label Encoder here. Now, let's consider the following data: In this example, the first column is the country column, which is all