In the age of big data and data science, stream processing is very significant. So it’s not at all surprising that every major organisation has at least one stream processing service. Apache has a few too, but today we’re going to look at Apache’s Kafka Streams.
Kafka is a very popular pub-sub service. And if you’ve worked with Kafka before, Kafka Streams is going to be very easy to understand. And if you haven’t got any idea of Kafka, you don’t have to worry, because most of the underlying technology has been abstracted in Kafka Streams so that you don’t have to deal with consumers, producers, partitions, offsets, and the such. In this post, we’ll look that a few concepts of Kafka Streams, and maybe understand how it differs from other stream processing engines.
First of all, Kafka Streams is build on top of Apache Kafka. What this means is that you don’t need extra hardware or infrastructure to move your processing to Kafka Streams. But why would you move to Kafka Streams if you’re already using Kafka for the same purpose? Well, because at it’s heart, Kafka isn’t really a stream processing engine. It uses micro batching to achieve it. But with Kafka Streams, you process each incoming message on a per-message basis. So the latency is actually in milliseconds.
Also, an incoming piece of data in Kafka Streams isn’t called a message, as is the case in traditional Kafka. We call it a record or a fact. And there are no topics of the traditional sense in Kafka Streams. We have stuff like KStreams and KTables. It gets really interesting.
At this point, let’s see why Kafka Steams would be interesting for your application:
- Elastic, highly scalable, fault-tolerant
- Deploy to containers, VMs, bare metal, cloud
- Equally viable for small, medium, & large use cases
- Fully integrated with Kafka security
- Write standard Java and Scala applications
- Exactly-once processing semantics
- No separate processing cluster required
- Develop on Mac, Linux, Windows
Kafka Streams is used to process an unbounded flow of facts or records. That means there is a continuous flow of data into the stream, it virtually never stops. And each record or fact is a collection of key-value pairs. Theses key-value pairs are typed. Internally though, Kafka messages are just an array of bytes.
What’s also interesting is that Kafka Streams provides both stateless and stateful processing, along with various windowing operations. You also can do joins and aggregations.
The Kafka Streams APIs work within your application. What this means is that you don’t run Kafka Streams as a separate cluster or infrastructure. You use the Kafka Streams APIs within your application to process streams of data. If you’re writing a microservice, you use the Kafka Streams APIs within that service, and you scale the microservice to in-turn scale the Streams API as well. So if your microservices are distributed across say ten nodes, there will be ten instances of Kafka Steams APIs as well. And these instances will connect to Kafka brokers and receive the stream data effortlessly. This even works inside VMs or containers.
There are a lot of other interesting concepts in Kafka Streams. We’ll look at most of them with code examples in later posts.