Home>>Data Science>>How to build a simple data lake using Amazon Kinesis Data Firehose and Amazon S3
data lake
Data Science

How to build a simple data lake using Amazon Kinesis Data Firehose and Amazon S3

As the data generated from IoT devices, mobile devices, applications, etc. increases at an hourly rate, creating a data lake to store all that data is getting crucial for almost any application at scale. There are many tools and services that you could use to create a data lake. But sometimes, you overlook the simplest and the easiest of them all, the AWS stack. In this post, we’ll see how we can create a very simple, yet highly scalable data lake using Amazon’s Kinesis Data Firehose and Amazon’s S3.


Amazon Kinesis Data Firehose

Kinesis Data Firehose is a tool / service that Amazon offers as a part of AWS that is built for handling large scale streaming data from various sources and dumping that data into a data lake. Not just that, Firehose is even capable of transforming the streaming data before it reaches the data lake. The best part is, the transformation happens completely serverless, and without needing a complex pipeline setup. You only need to create a Lambda function which takes the incoming raw data as input, and returns the transformed data as the output. This output data will endup in the data lake.

Firehose is designed to be scalable right out of the box. So there’s not really anything we need to do as users to make Firehose scalable. And because it’s completely serverless, we don’t even have to maintain EC2 instances. And, you only pay for what you use.


Amazon S3

Amazon’s S3, or Simple Storage Service, is nothing new. It has been around for ages. S3 is a great service when you want to store a great number of files online and want the storage service to scale with your platform. S3 is a great tool to use as a data lake. It has built in permission manager at not just the bucket level, but at the file (or item) level. So you don’t have to worry about your data becoming public.

Anyway, let’s now see how we can use these two together to create a data lake.


Creating a data lake with Firehose and S3

Now that we know, at a basic level, what Firehose and S3 are, let’s now see how we can use them together to create a simple data lake. First, we need to make changes to the services from which we collect data. The change is, we need to integrate the AWS Firehose SDK into those services and send the data to Firehose from these services.

For example, suppose you want to dump all the logs generated from your services to the data lake. You can write a FirehoseLogWriter service which takes the data to be logged. This logger, instead of writing the logs to a file or the standard out, will send that data to a Firehose stream. This way, all your logs will now be directly streamed to Firehose.

Next, you can configure that Firehose stream to write all the data coming in to an S3 destination. This way, Firehose will start creating log files in the specified S3 location automatically, and will even rotate these files automatically.

Once the data is in S3, you can use a plethora of other tools to query that data and build dashboards with visualizations. But you might ask, I mentioned earlier that you can transform the data. How to do that?


Transforming incoming data in Firehose Stream

It’s not very hard. I already mentioned that we can use Lambda functions for this. To begin with, Kinesis Firehose provides the option to convert the data format right out of the box. You can convert the data to something like Parquet or JSON, by just selecting the option during the setup of the stream.

But most often than not, this is not enough, you’ll need to add or remove fields from the data, or translate the data to be understable to other services. In such cases, you can write a Lambda function, in any of the supported language, and use that Lambda function as your transformation layer. This is also pretty simple.

When you’re creating or editing the stream, AWS gives you an option for this. So you need to have the Lambda function ready even before you create the stream. After you get the Lambda function ready, just select it as the transformation layer in the stream. So whenever the stream gets a piece of data, it will invoke the Lambda function and give it the data. The Lambda function, after manipulating the data, can output that new data, and the stream will continue with the next steps. The best part is, you can have multiple transformation layers like this.

If you think is cool enough, try it out once. You’ll be surprised at how simply this works.


And if you like what you see here, or on my Medium blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Become a Patron!

1 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: