Redundancy in a distributed system
6 min readA lot of engineers, system designers, architects, etc. overlook redundancy, at least according to what I’ve seen in my experience. Sometimes people ignore it because the system or the product is still in it’s early stages, so there’s not a lot happening. And some others ignore it just because. But we have all suffered at the hands of not having duplicate stuff in our systems. In this post, I’ll try to explain what I’ve seen in the industry of distributed computing over the last few years and why replication (duplication or redundancy) is important right from day one. Hopefully, this one will not be a boring one, let’s get started.
Types of Redundancies
In software engineering, redundancy is broadly classified into three types:
- Information redundancy
- Time redundancy
- Physical redundancy
Information redundancy is very easy to understand. Whenever data moves from one system to another, there is no guarantee that all the data will be received. There could be drops in the network, or somebody might alter the data during transmission. For this, we add a check bit to the data. At the receiving end, we use this bit to validate the correctness of the data received. This is information redundancy.
Time redundancy is also pretty easy to understand. You perform an operation in your system. And if there is a need, you’ll perform it again. You can consider transactions in databases as examples of time redundancy.
Now, physical redundancy, which is what we’re going to talk about in this post, is the most common of form of redundancy that we do in a production environment. Physical redundancy is where you add extra bits of hardware to your setup to scale things up or to be “highly available.”
Physical Redundancy
We use physical redundancy mostly for scaling up our services. At least, this is the most common use case. And we start thinking of this only when we’re not able to process all the requests we’re getting with just one box. So we’ll start spinning up more boxes. Because this only happens when the traffic increases, it is safe to assume that it’ll not happen on the first day.
Another use case of having physical redundancy is in the database layer, where you want to backup your data on another box if in case your database box goes down. This is the second most common use case I’ve seen. But, even this, doesn’t happen from day one. Most people start replicating the data layer only after their first system failure. But fortunately, in most cases, this happens very early in the life of a product.
But I’m here to tell you that if not anything, we should at least start having redundancy in the data layer and the application layer from day one. I’ll try and explain this with an example. And in this example, our system or product is made up of microservices. Let’s consider Instagram as an example. Practically, Instagram probably has an auth service and an image service (I know that “image service” sounds very vague, but we’ll just go with that for the example).
The system sends a request to the auth service with the user’s details whenever a user tries to login to the app. Based on the response of the auth service, the system decides whether the user has to be redirected to the image service or not. In this example, auth service is the entry point to your product. And if auth service goes down, none of your users will be able to use your product. So auth service becomes your one point of failure.
Now let’s assume you’re running only one instance of the auth service. And for some reason, that box goes down. So you don’t have an auth service anymore. This means 100% of your traffic is not coming into your product. You’re experiencing a failure that you didn’t anticipate. And if you didn’t anticipate this failure, you’re doing something wrong designing a system anyway. But anyway, what will you do now?
You’ll keep losing business until you fix the auth service box and bring it up again. But you don’t want this to happen again, what do you do? A simple solution would be to introduce redundancy. Setup a load balancer and have at least two instances of the auth service running at all times on two separate boxes. This way, even if one box goes down, there will be one more box to handle things for a while. Ideally, there should be redundancy at all levels, even the load balancer.
The same thing applies to the data layer as well. What if your database goes down, there’s an error in the disk of the server you’re using and the data can’t be restored? You’re doomed. All your data is gone. This happens very rarely, but it does happen. The solution? Replicate your database.
There are many ways in which you can do this. And most companies that sell you a database will offer you a database backup service as well. You can do this backup in real time and even make the backup server step in as the primary node for the database if something goes wrong. I recommend you do this from day one. So even the first piece of data that the first ever user generated for you is available anytime you want.
One more thing…
There’s one more thing I’ve seen people do on production environments. People and companies usually buy very big boxes and run multiple services on that one box. This is a recipe for disaster. If that one box goes down, multiple services will go down. The solution to this is to buy small boxes, and run a maximum of two services on a box. Scale everything horizontally instead of vertically. You should scale vertically only in instances where there’s only one service running on a box, and that one service needs more resources. You should never scale vertically to run multiple services on one box.
This way, it becomes very easy to even setup auto scaling rules for when it becomes necessary. If you’re running multiple services on a single box, auto scaling becomes very difficult. Also, having small boxes makes it easy to maintain the distributed nature of the application. But of course, there are use cases and scenarios where you need to have big boxes. An example of this is any data related operation. Any Apache Spark job, for instance, needs a box with good enough memory.
But, there’s a clear difference here. You need such big boxes for large scale computations, not for the regular API services. Anything which is consumer facing should be small enough to fit in a small box. Batch processing, of course, is going to need some beefy boxes.
So, the next time you’re designing a distributed system, keep in mind that any service which is exposed to the outside world, should be small enough to fit in a small box. There should be redundancy at all layers, including and especially application and data layers. And there should be auto scaling setup.
I hope this made sense, and yes, of course, I’m not always right and this is not going to sound practical for all applications. But, this comes from my experience. So there are cases where this is true. If there’s anything you want to add or if you want to correct me, please feel free to leave a comment below or connect with me.
And if you like what you see here, or on my Medium blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.
Become a Patron!