ColumnTransformer in SciKit for LabelEncoding and OneHotEncoding in Machine Learning

Data Science

by Sunny Srinidhi - November 6, 2019November 6, 20193

In a very old post - Label Encoder vs. One Hot Encoder in Machine Learning - I had demonstrated how to use label encoding and one hot encoding to separate out categorical text data into numbers and different columns. But the SciKit library has come a long way since I wrote that post, and it has made life a lot more easier. The developers of the library might have realised that people use LabelEncoding and OneHotEncoding very frequently. So they decided to come up with a new library called the ColumnTransformer, which will basically combine LabelEncoding and OneHotEncoding into just one line of code. And the result is exactly the same. In this post, we'll quickly take a look at

Invoke an AWS Lambda Function from another Lambda Function

by Sunny Srinidhi - November 4, 2019November 4, 20190

I recently discovered that you can't invoke more than one Lambda function in AWS for an S3 event, with the same prefix and suffix (or just with the same suffix, which was the issue in my case). So I wanted a way to invoke one Lambda function from another Lambda function. If you're feeling kind of lost, check out the problem statement in my Github project. That could possibly add some context to the problem. If you don't want to go there, I'll try to explain it here again. The Problem and the Requirement In one of our projects, we have a Lambda function which is invoked whenever a text file is uploaded to a particular S3 bucket. The Lambda function takes

Apache Kafka Streams and Tables, the stream-table duality

by Sunny Srinidhi - October 1, 2019February 25, 20200

In the previous post, we tried to understand the basics of Apache's Kafka Streams. In this post, we'll build on that knowledge and see how Kafka Streams can be used both as streams and tables. Stream processing has become very common in most modern applications today. You'll have a minimum of one stream coming into your system to be processed. And depending on your application, it'll mostly be stateless. But that's not the case with all applications. We'll have some sort of data enrichment going on in between streams. Suppose you have one stream of user activity coming in. You'll ideally have a user ID attached to each fact in that stream. But down the pipeline, user ID is

Getting started with Apache Kafka Streams

by Sunny Srinidhi - September 30, 2019March 12, 20201

In the age of big data and data science, stream processing is very significant. So it's not at all surprising that every major organisation has at least one stream processing service. Apache has a few too, but today we're going to look at Apache's Kafka Streams. Kafka is a very popular pub-sub service. And if you've worked with Kafka before, Kafka Streams is going to be very easy to understand. And if you haven't got any idea of Kafka, you don't have to worry, because most of the underlying technology has been abstracted in Kafka Streams so that you don't have to deal with consumers, producers, partitions, offsets, and the such. In this post, we'll look that a few concepts of

Put data to Amazon Kinesis Firehose delivery stream using Spring Boot

by Sunny Srinidhi - September 26, 2019February 12, 20201

If you work with streams of big data which have to be collected, transformed, and analysed, you for sure would have heard of Amazon Kinesis Firehose. It is an AWS service used to load streams of data to data lakes or analytical tools, along with compressing, transforming, or encrypting the data. You can use Firehose to load streaming data to something like S3, or RedShift. From there, you can use a SQL query engine such as Amazon Athena to query this data. You can even connect this data to your BI tool and get real time analytics of the data. This could be very useful in applications where real time analysis of data is necessary. In this post, we'll see

How to Query Athena from a Spring Boot application?

by Sunny Srinidhi - September 25, 2019March 3, 20203

In the last post, we saw how to query data from S3 using Amazon Athena in the AWS Console. But querying from the Console itself if very limited. We can't really do much with the data, and anytime we want to analyse this data, we can't really sit in front of the console the whole day and run queries manually. We need to automate the process. And what better way to do that than writing a piece of code? So in this post, we'll see how we can use the AWS Java SDK in a Spring Boot application and query the same sample data set from the previous post. We'll then log it to the console to make sure we're

Query data from S3 files using Amazon Athena

by Sunny Srinidhi - September 24, 2019March 7, 20201

Amazon Athena is defined as "an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL." So, it's another SQL query engine for large data sets stored in S3. This is very similar to other SQL query engines, such as Apache Drill. But unlike Apache Drill, Athena is limited to data only from Amazon's own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc. In this post, we'll see how we can setup a table in Athena using a sample data set stored in S3 as a .csv file. But for this, we first need

Use Apache Drill with Spring Boot or Java to query data using SQL queries

by Sunny Srinidhi - September 24, 2019April 7, 20200

In the last few posts, we saw how to connect Apache Drill with MongoDB and also how we can connect it to Kafka to query data using simple SQL queries. But when you want to move this to an actual real world project, you can't sit around querying data from a terminal all day long. You want to write a piece of code which does the dirty work for you. But how exactly do you use Apache Drill within your code? Today, we'll see how we can achieve this with Spring Boot, or pretty much any other Java program. The Dependencies For this POC, I'm going to write a simple Spring Boot CommandLineRunner program. But you can use pretty much any other

Apache Drill vs. Apache Spark – Which SQL query engine is better for you?

by Sunny Srinidhi - September 23, 2019February 13, 20200

If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that

Analyse Kafka messages with SQL queries using Apache Drill

by Sunny Srinidhi - September 23, 2019January 13, 20201

In the previous post, we figured out how to connect MongoDB with Apache Drill and query data with SQL queries. In this post, let's extend that knowledge and see how we can use similar SQL queries to analyse our Kafka messages. Configuring the Kafka storage plugin in Apache Drill is quite simple, very similar to how we configured the MongoDB storage plugin. First, we run our local instances of Apache Drill, Apache Zookeeper, and Apache Kafka. After this, head over to http://localhost:8047/storage, where we can enable the Kafka plugin. You should see it in the list to the right of the page. Click the Enable button. The storage plugin will be enabled. After this, we need to add a few configuration