Getting started with Apache Kafka StreamsData ScienceTech by Sunny Srinidhi - September 30, 2019March 12, 20201 In the age of big data and data science, stream processing is very significant. So it’s not at all surprising that every major organisation has at least one stream processing service. Apache has a few too, but today we’re going to look at Apache’s Kafka Streams.Kafka is a very popular pub-sub service. And if you’ve worked with Kafka before, Kafka Streams is going to be very easy to understand. And if you haven’t got any idea of Kafka, you don’t have to worry, because most of the underlying technology has been abstracted in Kafka Streams so that you don’t have to deal with consumers, producers, partitions, offsets, and the such. In this post, we’ll look that a few concepts of Kafka Streams, and maybe understand how it differs from other stream processing engines. First of all, Kafka Streams is build on top of Apache Kafka. What this means is that you don’t need extra hardware or infrastructure to move your processing to Kafka Streams. But why would you move to Kafka Streams if you’re already using Kafka for the same purpose? Well, because at it’s heart, Kafka isn’t really a stream processing engine. It uses micro batching to achieve it. But with Kafka Streams, you process each incoming message on a per-message basis. So the latency is actually in milliseconds.Also, an incoming piece of data in Kafka Streams isn’t called a message, as is the case in traditional Kafka. We call it a record or a fact. And there are no topics of the traditional sense in Kafka Streams. We have stuff like KStreams and KTables. It gets really interesting.At this point, let’s see why Kafka Steams would be interesting for your application:Elastic, highly scalable, fault-tolerantDeploy to containers, VMs, bare metal, cloudEqually viable for small, medium, & large use casesFully integrated with Kafka securityWrite standard Java and Scala applicationsExactly-once processing semanticsNo separate processing cluster requiredDevelop on Mac, Linux, WindowsKafka Streams is used to process an unbounded flow of facts or records. That means there is a continuous flow of data into the stream, it virtually never stops. And each record or fact is a collection of key-value pairs. Theses key-value pairs are typed. Internally though, Kafka messages are just an array of bytes. What’s also interesting is that Kafka Streams provides both stateless and stateful processing, along with various windowing operations. You also can do joins and aggregations.The Kafka Streams APIs work within your application. What this means is that you don’t run Kafka Streams as a separate cluster or infrastructure. You use the Kafka Streams APIs within your application to process streams of data. If you’re writing a microservice, you use the Kafka Streams APIs within that service, and you scale the microservice to in-turn scale the Streams API as well. So if your microservices are distributed across say ten nodes, there will be ten instances of Kafka Steams APIs as well. And these instances will connect to Kafka brokers and receive the stream data effortlessly. This even works inside VMs or containers.There are a lot of other interesting concepts in Kafka Streams. We’ll look at most of them with code examples in later posts.Share this:TwitterFacebookLike this:Like Loading...Related