In most real world applications, we have a RESTful API service facing various client applications and a collection of backend services which process the data coming from those clients. Depending on the application, the architecture might have various services spread across multiple clusters of servers, and some form of queue or messaging service gluing them together. Today, we’re going to talk about one such messaging service – Apache Kafka – and how it can improve the performance of your services.
We’re going to assume that we have at least two microservices, one for the APIs that are exposed to the world, and one which processes the requests coming in from the API microservice, but in an async fashion. Because this is async in nature, this might not be suitable for the kind of applications where the clients will be waiting for some data to be returned from the API.
Let’s suppose a situation where the clients send some data via the APIs to your servers. You need to process this data and store it in a database, let’s say, for some sort of analytical calculation. In such a situation, the clients will not be waiting for any kind of response from the server, just an acknowledgement that the server has received the data.
In such cases, we’ll have our API gateway take the data coming from the clients, put in into our Kafka topics, and then send a response to the clients saying that the data is received. This way, the clients will not be waiting for a long time for the ack. Because we’re not processing the data in the API microservice, we can quickly send back a response to the clients.
The other microservice we have, let’s call it the data processing microservice, will read messages coming in from Kafka, and process each message, one after another.
Apache Kafka is designed in such a way that we can have multiple instances of our data processing microservice reading form the same topic. The way we achieve this is by having multiple “partitions” in our Kafka topic. This way, consumers from the same group can read from the same Kafka topic (from different partitions) and be assured that they’ll never read the same message. In other words, Kafka assures that the same message will not be sent to more than one partition, unless we receive a duplicate message in the topic.
As you can see from the block diagram above, we have two instances of the same data processing microservice, which read from two different partitions of the same Kafka topic. The API microservice can specify which partition a message has to go to, if required. If not, the producer in the microservice can just produce the message to the topic, and Kafka will figure out which partition the message has to go to.
During the processing of the messages, both the instances of the data processing microservice can talk to the same instance of a database, or a storage service such as Amazon S3 or Google Drive, or any other service inside or outside the cluster.
Once the processing is complete, the microservice could either manually commit the offset or make it automatic.
Committing Kafka offset manually vs. automatically
Letting Kafka handle the committing of offsets seems to be easy and pretty straight forward. But in some cases, you might not want Kafka to commit your offsets manually. For example, suppose the connection to your database if down, and you’re not able to process a message received from Kafka. In such cases, if you commit the offset, you lose that message forever, unless you know the commit ID and go back in time and start consuming messages from that commit ID.
Instead of going through all that pain, it’s easier to just commit the offset manually after processing each message. This way, you can wait until your database connection is back online, process the message, commit the offset, and then continue with the next message.
We have found this approach to be a lot easier to handle in our applications. Then again, this completely depends on the kind of project you’re working on, and how important each message is for you.
That’s pretty much how you can decouple your APIs from your business logic and improve the performance, using Apache Kafka. This is the very basic and the very first step you take in introducing Kafka to your architecture. If you think I’ve missed something or went wrong somewhere, please let me know in the comments.