Data ScienceTech

Connect Apache Spark to your MongoDB database using the mongo-spark-connector

A couple of days back, we saw how we can connect Apache Spark to an Apache HBase database and query the data from a table using a catalog. Today, we’ll see how we can connect Apache Spark to a MongoDB database and get data directly into Spark from there. MongoDB provides us a plugin called the mongo-spark-connector, which will help us connect MongoDB and Spark without any drama at all. We just need to provide the MongoDB connection URI in the SparkConf object, and create a ReadConfig object specifying the collection name. It might sound complicated right now, but once you look at the code, you’ll understand how extremely easy this is. So, let’s look at an example. Source: mongodb.com The Dataset Before we look at the code, we need to make sure we have some data in our ...

Read More
Data ScienceTech

Connect Apache Spark to your HBase database (Spark-HBase Connector)

There will be times when you’ll need the data in your HBase database to be brought into Apache Spark for processing. Usually, you’ll query the database, get the data in whatever format you fancy, and then load that into Spark, maybe using the `parallelize()`function. This works, just fine. But depending on the size of the data, this could cause delays. At least it did for our application. So after some research, we stumbled upon a Spark-HBase connector in Hortonworks repository. Now, what is this connector and why should you be considering this? Source: https://spark.apache.org/ The Spark-HBase Connector (shc-core) The SHC is a tool provided by Hortonworks to connect your HBase database to Apache Spark so that you can tell your Spark context to pickup the data directly fro...

Read More
Data Science

What is multicollinearity?

Image from StaticsticsHowTo Multicollinearity is a term we often come across when we're working with multiple regression models. Even we have talked about it in our previous posts, but do we know what it actually means? Today, we'll try to understand that. In most real life problems, we usually have multiple features to work with. And not all of them are in the format that we, or the model, wants. For example, a lot of categorical features are usually in the text format. But as we already know, our models require the features to be numerical. For this, we will label encode the feature and if required, we'll even one hot encode them. But in some cases, we might have features whose values can be easily determined by the values of other features. In other words, we can see a very go...

Read More
Data Science

Overfitting and Underfitting models in Machine Learning

In most of our posts about machine learning, we've talked about overfitting and underfitting. But most of us don't yet know what those two terms mean. What does it acutally mean when a model is overfit, or underfit? Why are they considered not good? And how do they affect the accuracy of our model's predictions? These are some of the basic, but important questions we need to ask and get answers to. So let's discuss these two today. The datasets we use for training and testing our models play a huge role in the efficiency of our models. Its equally important to understand the data we're working with. The quantity and the quality of the data also matter, obviously. When the data is too less in the training phase, the models may fail to understand the patterns in the data, or fa...

Read More
Data Science

Different types of Validations in Machine Learning (Cross Validation)

Now that we know what is feature selection and how to do it, let's move our focus to validating the efficiency of our model. This is known as validation or cross validation, depending on what kind of validation method you're using. But before that, let's try to understand why we need to validate our models. Validation, or Evaluation of Residuals Once you are done with fitting your model to you training data, and you've also tested it with your test data, you can't just assume that its going to work well on data that it has not seen before. In other words, you can't be sure that the model will have the desired accuracy and variance in your production environment. You need some kind of assurance of the accuracy of the predictions that your model is putting out. For this, we need to val...

Read More
Data Science

Different methods of feature selection

In our previous post, we discussed what is feature selection and why we need feature selection. In this post, we're going to look at the different methods used in feature selection. There are three main classification of feature selection methods - Filter Methods, Wrapper Methods, and Embedded Methods. We'll look at all of them individually. Filter Methods Filter methods are learning-algorithm-agnostic, which means they can be employed no matter which learning algorithm you're using. They're generally used as data pre-processors. In filter methods, each individual feature in the dataset will be scored on its correlation with the dependent variable. A variety of statistical tests will be used to calculate this correlation score. Based on this score, it will be decided whether to retai...

Read More
Data Science

What is Feature Selection and why do we need it in Machine Learning?

If you've come across a dataset in your machine learning endeavors which has more than one feature, you'd have also heard of a concept called Feature Selection. Today, we're going to find out what it is and why we need it. When a dataset has too many features, it would not be ideal to include all of them in our machine learning model. Some features may be irrelevant for the independent variable. For example, if you are going to predict how much it would cost to crush a car, and the features you're given are: the dimensions of the car if the car will be delivered to the crusher or the company has to go pick it up if the car has any fuel in the tank the color of the car you can kind of assume that the color of the car is not going to influence the cost of crushing it,...

Read More
Data Science

Linear Regression in Python using SciKit Learn

Today we'll be looking at a simple Linear Regression example in Python, and as always, we'll be using the SciKit Learn library. If you haven't yet looked into my posts about data pre-processing, which is required before you can fit a model, checkout how you can encode your data to make sure it doesn't contain any text, and then how you can handle missing data in your dataset. After that you have to make sure all your features are in the same range for the model so that one feature is not dominating the whole output; and for this, you need feature scaling. Finally, split your data into training and testing sets. Once you're done with all that, you're ready to start your first and the most simple machine learning model, Linear Regression. For this example, we're going to use a diff

Read More
Data Science

Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?

When you're working with a learning model, it is important to scale the features to a range which is centered around zero. This is done so that the variance of the features are in the same range. If a feature's variance is orders of magnitude more than the variance of other features, that particular feature might dominate other features in the dataset, which is not something we want happening in our model. The aim here is to to achieve Gaussian with zero mean and unit variance. There are many ways of doing this, two most popular are standardisation and normalisation. No matter which method you choose, the SciKit Learn library provides a class to easily scale our data. We can use the StandardScaler class from the library for this. Now that we know why we need to scale our features, le

Read More
Data Science

How to split your dataset to train and test datasets using SciKit Learn

When you're working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. For this, you'll a dataset which is different from the training set you used earlier. But it might not always be possible to have so much data during the development phase. In such cases, the obviously solution is to split the dataset you have into two sets, one for training and the other for testing; and you do this before you start training your model. But the question is, how do you split the data? You can't possibly manually split the dataset into two. And you also have to make sure you split the data in a random manner. To help us with this task, the SciKit library provides a tool, called the Model Selection library. There's a cla...

Read More