Choreography-based Saga for Microservices and Serverless Applications
5 min readIf you have been working in the software development industry for the last few years, you have heard about both microservices and serverless applications. Especially serverless, as most companies are riding this wave, and for good. But, when you’re architecting a whole system as a bunch of microservices, serverless or not, how do you make sure that all transactions are taken care of properly?
When your application is broken down into microservices and is distributed, you can’t have ACID transactions on your databases. Because each microservice will have its own database and usually, a feature or function involves multiple such microservices. In this post, we’ll take the example of the most popular online shopping service, Amazon, and see how we can come up with one possible solution.
But, before we begin, I want to make it clear that I’m not talking about the actual architecture used by Amazon. I’m just taking the example of Amazon. You can actually consider any e-commerce service for that matter. Also, I’m mostly talking about microservices in this post. But when I say microservice, please feel free to replace that with serverless. Because most of the things I say about microservices in this context equally applies to serverless applications as well. So, with that out of the way, let’s get started.
The Problem
When you have designed your application as a bunch of microservices, there are a couple of assumptions I’m making:
- Your have distributed services
- Each microservice has it’s own database
With these assumptions, let’s consider this scenario. You’re trying to place an order. For this, you first select the product you want to order, and proceed to checkout. Here, you enter your card details, and wait for the payment to go through. During this time, if you open up Amazon on another device or browser and go to your orders page, you can see that the order is already placed. But the status is still in pending. When you proceed to payment in the checkout page, the Amazon app has to actually talk to another service, probably called the Payment service.
And only after making the payment and updating the payment database, you have to confirm and place the order. Otherwise, you have to either cancel the order or put it on hold. Therefore, this becomes as transaction. But in a distributed system, how do you make sure the transaction is actually happening as expected? This is the problem we’re trying to solve.
The solution: Choreography-based saga
Now, just to be clear, this is not the only solution. There are a hundred different ways of designing a solution for this. Choreography-based saga is just one such solution. But wait, what the heck is a saga? A saga is nothing but a set of local transactions. And what exactly is choreography-based saga? That’s what we’re going to discuss here.
Before we begin, let’s see the flow of events when a user tries to place an order:
- The user places an order.
- The system creates the order, but saves the status as “pending.”
- The system redirects the user to the payment page.
- The user makes the required payment.
- If the payment was successful, the order status is updated to something on the lines of “confirmed.” Otherwise, the order could be cancelled.
As you can imagine, in a distributed system, the order service and the payment service are two different microservices. And most probably, these two services are running on different machines. So how do we maintain a transaction in such a situation? This is where choreography-based saga comes into picture.
In choreography-based saga, the first service performs it’s saga, which is the local transaction. So in our case, the order service performs it’s local transaction, which is just creating the order. Once this transaction is complete, the service uses either a message bus or an event bus to trigger the next transaction. In our case, the order service sends a message to the payment service on a message bus.
As soon as the payment service receives the message, it starts performing it’s saga. In this case, that’s processing the payment. As soon as this local transaction is complete, the payment service uses another message or event bus to start the next saga. For this example we don’t have another service in between. So the payment service just sends a message back to the order service about the status of the payment.
Once the order services hears back from the payment service, based on the status of the payment, it performs another saga which is to update the status of the order. There are no more messages to be sent, as this completes the process.
So we performed multiple small and local transactions, also known as sagas, to perform one big distributed transaction. But the obvious question here is, what happens if something fails in between? The service which is performing the saga when something fails is responsible for propagating the failure to all services that came before it.
So whenever there is a failure, there will be another flow of events in the opposite direction all the way to the first service which initiated the transaction. When such an event occurs, all the services which receive the failure message will undo the saga that they performed in this context.
This way, we can assure that we have a distributed transaction even when we’re working with microservices or serverless applications. The image below visualizes this process:
The following explains this flow of events:
- The user places an order.
- The order service creates the order, but saves the status as “pending.”
- The order service sends a message to a message bus connected to the payment service.
- The web app redirects the user to the payment page.
- The payment service processes the payment.
- The bank either approves or declines the payment.
- The payment service sends this status as a message to another message bus.
- The order service is listening to this message bus for the payment updates.
- The order status is updated to something on the lines of “confirmed” on successful payment. Otherwise, the order will be cancelled.
The last step here is dependent on the status of the previous transaction. The order should not be confirmed until the payment is made. This is mostly how traditional transactions work in a relational database.
Because each service is performing a saga then triggering the next service, we call this the choreography-based saga.