Data Science

Backward Elimination for Feature Selection in Machine Learning

When we're building a machine learning model, it is very important that we select only those features or predictors which are necessary. Suppose we have 100 features or predictors in our dataset. That doesn't necessarily mean that we need to have all 100 features in our model. This is because not all 100 features will have significant influence on the model. But then again, this doesn't mean it will be true for all cases. It depends entirely on the data we have in hand. Here is more info about why we need feature selection. There are various ways in which you can find out which features have very less impact on the model and which ones you can remove from your dataset. I have written about feature selection before, but that was very brief. In this post, we'll look at Backward Eliminati...

Read More
Data Science

Null Hypothesis and the P-Value

When you're starting your machine learning journey, you'll come across null hypothesis and the p-value. At a certain point in your journey, it becomes quite important to know what these mean to make meaningful decisions while designing your machine learning models. So in this post, I'll try to explain what these two things mean, and you try to understand that. Now, if you don't have a background in statistics, the definitions of null hypothesis and p-value will make no sense to you. It's just gibberish going way over your head. That's what happened to me the first few times I tried to understand them. It took me a good couple of days to get an idea of what they mean. I could still be wrong in my understanding to this very day. And I'm sure that you guys will have more knowledge about t...

Read More
Data Science

Fit vs. Transform in SciKit libraries for Machine Learning

We have seen methods such as fit(), transform(), and fit_transform() in a lot of SciKit's libraries. And almost all tutorials, including the ones I've written, only tell you to just use one of these methods. The obvious question that arises here is, what do those methods mean? What do you mean by fit something and transform something? The transform() method makes some sense, it just transforms the data, but what about fit()? In this post, we'll try to understand the difference between the two. To better understand the meaning of these methods, we'll take the Imputer class as an example, because the Imputer class has these methods. But before we get started, keep in mind that fitting something like an imputer is different from fitting a whole model. You use an Imputer to handle missi...

Read More
Data Science

ColumnTransformer in SciKit for LabelEncoding and OneHotEncoding in Machine Learning

In a very old post - Label Encoder vs. One Hot Encoder in Machine Learning - I had demonstrated how to use label encoding and one hot encoding to separate out categorical text data into numbers and different columns. But the SciKit library has come a long way since I wrote that post, and it has made life a lot more easier. The developers of the library might have realised that people use LabelEncoding and OneHotEncoding very frequently. So they decided to come up with a new library called the ColumnTransformer, which will basically combine LabelEncoding and OneHotEncoding into just one line of code. And the result is exactly the same. In this post, we'll quickly take a look at how we can do that with some code snippets. The Code First, as usual, we need to import the required li...

Read More
Data ScienceTech

Apache Kafka Streams and Tables, the stream-table duality

In the previous post, we tried to understand the basics of Apache's Kafka Streams. In this post, we'll build on that knowledge and see how Kafka Streams can be used both as streams and tables. Stream processing has become very common in most modern applications today. You'll have a minimum of one stream coming into your system to be processed. And depending on your application, it'll mostly be stateless. But that's not the case with all applications. We'll have some sort of data enrichment going on in between streams. Suppose you have one stream of user activity coming in. You'll ideally have a user ID attached to each fact in that stream. But down the pipeline, user ID is not going to be enough for processing. Maybe you need more information about the user to be present in t...

Read More
Data ScienceTech

Put data to Amazon Kinesis Firehose delivery stream using Spring Boot

If you work with streams of big data which have to be collected, transformed, and analysed, you for sure would have heard of Amazon Kinesis Firehose. It is an AWS service used to load streams of data to data lakes or analytical tools, along with compressing, transforming, or encrypting the data. You can use Firehose to load streaming data to something like S3, or RedShift. From there, you can use a SQL query engine such as Amazon Athena to query this data. You can even connect this data to your BI tool and get real time analytics of the data. This could be very useful in applications where real time analysis of data is necessary. In this post, we'll see how we can create a delivery stream in Kinesis Firehose, and write a simple piece of Java code to put records (produce data) to t...

Read More
Data ScienceTech

How to Query Athena from a Spring Boot application?

In the last post, we saw how to query data from S3 using Amazon Athena in the AWS Console. But querying from the Console itself if very limited. We can't really do much with the data, and anytime we want to analyse this data, we can't really sit in front of the console the whole day and run queries manually. We need to automate the process. And what better way to do that than writing a piece of code? So in this post, we'll see how we can use the AWS Java SDK in a Spring Boot application and query the same sample data set from the previous post. We'll then log it to the console to make sure we're getting the right data. The Dependencies Before we get to the code, let's first get our dependencies right. I did the painstaking task of finding the right dependencies for this POC. All...

Read More
Data ScienceTech

Query data from S3 files using Amazon Athena

Amazon Athena is defined as "an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL." So, it's another SQL query engine for large data sets stored in S3. This is very similar to other SQL query engines, such as Apache Drill. But unlike Apache Drill, Athena is limited to data only from Amazon's own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc. In this post, we'll see how we can setup a table in Athena using a sample data set stored in S3 as a .csv file. But for this, we first need that sample CSV file. You can download it here: sampleDataDownload Once you have the file downloaded, create a new bucket in ...

Read More
Data ScienceTech

Apache Drill vs. Apache Spark – Which SQL query engine is better for you?

If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that there is a fundamental difference between the two, how they are implemented, and what they are capable of. With Apache Drill, we write SQL quer...

Read More
Data ScienceTech

Getting Started with Apache Drill and MongoDB

Not a lot of people have heard of Apache Drill. That is because Drill caters to very specific use cases, it's very niche. But when used, it can make significant differences to the way you interact with data. First, let's see what Apache Drill is, and then how we can connect our MongoDB data source to Drill and easily query data. What is Apache Drill? According to their website, Apache Drill is "Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage." That's pretty much self-explanatory. So, Drill is a tool to query Hadoop, MongoDB, and other NoSQL databases. You can write simple SQL queries that run on the data stored in other databases, and you get the result in a row-column format. The best part is you can even query Apache Kafka and AWS S3 data with this. ...

Read More