Tag Archives: datascience

Apache Spark SQL User Defined Function (UDF) POC in Java

By | May 14, 2019

If you’ve worked with Spark SQL, you might have come across the concept of User Defined Functions (UDFs). As the name suggests, it’s a feature where you define a function, pretty straight forward. But how is this different from any other custom function that you write? Read more...

Connect Apache Spark to your HBase database (Spark-HBase Connector)

By | April 1, 2019

There will be times when you’ll need the data in your HBase database to be brought into Apache Spark for processing. Usually, you’ll query the database, get the data in whatever format you fancy, and then load that into Spark, maybe using the `parallelize()`function. Read more...

Overfitting and Underfitting models in Machine Learning

By | August 2, 2018


In most of our posts about machine learning, we’ve talked about overfitting and underfitting. But most of us don’t yet know what those two terms mean. What does it acutally mean when a model is overfit, or underfit? Why are they considered not good? Read more...

Different types of Validations in Machine Learning (Cross Validation)

By | August 1, 2018


Now that we know what is feature selection and how to do it, let’s move our focus to validating the efficiency of our model. This is known as validation or cross validation, depending on what kind of validation method you’re using. Read more...

Different methods of feature selection

By | July 31, 2018


In our previous post, we discussed what is feature selection and why we need feature selection. In this post, we’re going to look at the different methods used in feature selection. There are three main classification of feature selection methods – Filter Methods, Wrapper Methods, and Embedded Methods. Read more...

What is Feature Selection and why do we need it in Machine Learning?

By | July 31, 2018


If you’ve come across a dataset in your machine learning endeavors which has more than one feature, you’d have also heard of a concept called Feature Selection. Today, we’re going to find out what it is and why we need it. Read more...

Linear Regression in Python using SciKit Learn

By | July 30, 2018

Today we’ll be looking at a simple Linear Regression example in Python, and as always, we’ll be using the SciKit Learn library. If you haven’t yet looked into my posts about data pre-processing, which is required before you can fit a model, checkout how you can encode your data to make sure it doesn’t contain any text, and then how you can handle missing data in your datasetRead more...

Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?

By | July 27, 2018

When you’re working with a learning model, it is important to scale the features to a range which is centered around zero. This is done so that the variance of the features are in the same range. If a feature’s variance is orders of magnitude more than the variance of other features, that particular feature might dominate other features in the dataset, which is not something we want happening in our model. Read more...

How to split your dataset to train and test datasets using SciKit Learn

By | July 27, 2018

When you’re working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. For this, you’ll a dataset which is different from the training set you used earlier. Read more...