Data ScienceTech

Apache Kafka Streams and Tables, the stream-table duality

In the previous post, we tried to understand the basics of Apache's Kafka Streams. In this post, we'll build on that knowledge and see how Kafka Streams can be used both as streams and tables. Stream processing has become very common in most modern applications today. You'll have a minimum of one stream coming into your system to be processed. And depending on your application, it'll mostly be stateless. But that's not the case with all applications. We'll have some sort of data enrichment going on in between streams. Suppose you have one stream of user activity coming in. You'll ideally have a user ID attached to each fact in that stream. But down the pipeline, user ID is not going to be enough for processing. Maybe you need more information about the user to be present in t...

Read More
Data ScienceTech

Getting started with Apache Kafka Streams

In the age of big data and data science, stream processing is very significant. So it's not at all surprising that every major organisation has at least one stream processing service. Apache has a few too, but today we're going to look at Apache's Kafka Streams. Kafka is a very popular pub-sub service. And if you've worked with Kafka before, Kafka Streams is going to be very easy to understand. And if you haven't got any idea of Kafka, you don't have to worry, because most of the underlying technology has been abstracted in Kafka Streams so that you don't have to deal with consumers, producers, partitions, offsets, and the such. In this post, we'll look that a few concepts of Kafka Streams, and maybe understand how it differs from other stream processing engines. First of all, Kafka...

Read More
Data ScienceTech

Apache Drill vs. Apache Spark – Which SQL query engine is better for you?

If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that there is a fundamental difference between the two, how they are implemented, and what they are capable of. With Apache Drill, we write SQL quer...

Read More
Data ScienceTech

Analyse Kafka messages with SQL queries using Apache Drill

In the previous post, we figured out how to connect MongoDB with Apache Drill and query data with SQL queries. In this post, let's extend that knowledge and see how we can use similar SQL queries to analyse our Kafka messages. Configuring the Kafka storage plugin in Apache Drill is quite simple, very similar to how we configured the MongoDB storage plugin. First, we run our local instances of Apache Drill, Apache Zookeeper, and Apache Kafka. After this, head over to http://localhost:8047/storage, where we can enable the Kafka plugin. You should see it in the list to the right of the page. Click the Enable button. The storage plugin will be enabled. After this, we need to add a few configuration parameters to start querying data from Kafka. Click the Update button next to Kafka, whi...

Read More
Data ScienceTech

Getting Started with Apache Drill and MongoDB

Not a lot of people have heard of Apache Drill. That is because Drill caters to very specific use cases, it's very niche. But when used, it can make significant differences to the way you interact with data. First, let's see what Apache Drill is, and then how we can connect our MongoDB data source to Drill and easily query data. What is Apache Drill? According to their website, Apache Drill is "Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage." That's pretty much self-explanatory. So, Drill is a tool to query Hadoop, MongoDB, and other NoSQL databases. You can write simple SQL queries that run on the data stored in other databases, and you get the result in a row-column format. The best part is you can even query Apache Kafka and AWS S3 data with this. ...

Read More
Data ScienceTech

Connect Apache Spark to your HBase database (Spark-HBase Connector)

There will be times when you’ll need the data in your HBase database to be brought into Apache Spark for processing. Usually, you’ll query the database, get the data in whatever format you fancy, and then load that into Spark, maybe using the `parallelize()`function. This works, just fine. But depending on the size of the data, this could cause delays. At least it did for our application. So after some research, we stumbled upon a Spark-HBase connector in Hortonworks repository. Now, what is this connector and why should you be considering this? Source: https://spark.apache.org/ The Spark-HBase Connector (shc-core) The SHC is a tool provided by Hortonworks to connect your HBase database to Apache Spark so that you can tell your Spark context to pickup the data directly fro...

Read More
Tech

How you can improve your backend services’ performance using Apache Kafka

In most real world applications, we have a RESTful API service facing various client applications and a collection of backend services which process the data coming from those clients. Depending on the application, the architecture might have various services spread across multiple clusters of servers, and some form of queue or messaging service gluing them together. Today, we're going to talk about one such messaging service - Apache Kafka - and how it can improve the performance of your services. We're going to assume that we have at least two microservices, one for the APIs that are exposed to the world, and one which processes the requests coming in from the API microservice, but in an async fashion. Because this is async in nature, this might not be suitable for the kind of ap...

Read More
Tech

Simple Apache Kafka Producer and Consumer using Spring Boot

Originally published here: https://medium.com/@contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b Before I even start talking about Apache Kafka here, let me answer your question after you read the topic — aren’t there enough posts and guides about this topic already? Yes, there are plenty of reference documents and how-to posts about how to create Kafka producers and consumers in a Spring Boot application. Then why am I writing another post about this? Well, in the future, I’ll be talking about some advanced stuff, in the data science space. Apache Kafka is one of the most used technologies and tools in this space. It kind of becomes important to know how to work with Apache Kafka in a real-world application. So this is an introductory po...

Read More