You are here
Home > Search Results for "s3"

Explore your Amazon S3 data online using Filestash

Filestash

Amazon’s S3, or Simple Storage Service, has become one of the most used cloud services today. We use it for all kind of purposes, including but not limited to data lakes, intermediary storage, persistence layer for databases, etc. I know people who use S3 as their personal online storage, as an alternative for services such Google Drive and Dropbox. Read more... “Explore your Amazon S3 data online using Filestash”

How to build a simple data lake using Amazon Kinesis Data Firehose and Amazon S3

data lake

As the data generated from IoT devices, mobile devices, applications, etc. increases at an hourly rate, creating a data lake to store all that data is getting crucial for almost any application at scale. There are many tools and services that you could use to create a data lake. Read more... “How to build a simple data lake using Amazon Kinesis Data Firehose and Amazon S3”

Query data from S3 files using Amazon Athena

amazon athena

Amazon Athena is defined as "an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL." So, it's another SQL query engine for large data sets stored in S3. This is very similar to other SQL query engines, such as Apache Drill. But unlike Apache Drill, Athena is limited to data only from Amazon's own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc. In this post, we'll see how we can setup a table in Athena using a sample data set stored in S3 as a .csv file. But for this, we first need

Cleaning and Normalizing Data Using AWS Glue DataBrew

stephen-dawson-qwtCeJ5cLYs-unsplash

A major part of any data pipeline is the cleaning of data. Depending on the project, cleaning data could mean a lot of things. But in most cases, it means normalizing data and bringing data into a format that is accepted within the project. Read more... “Cleaning and Normalizing Data Using AWS Glue DataBrew”

Getting Started With Apache Airflow

workflow

Apache Airflow is another awesome tool that I discovered just recently. Just a couple of months after discovering it, I can’t imagine not using it now. It’s reliable, configurable, and dynamic. Because it’s all driven by code, you can version control it too. Read more... “Getting Started With Apache Airflow”

Kinesis Data Streams vs. Kinesis Firehose Delivery Streams

sheep

I have talked about Kinesis before, and I'm sure you've been using Kinesis for longer than me. But according to what I've seen, not all teams or companies use all parts of Kinesis. And, there are four parts in Kinesis: Ingest and process streaming data with Kinesis streams - Kinesis Data StreamsDeliver streaming data with Kinesis Firehose delivery streams - Kinesis Firehose Delivery StreamsAnalyse streaming data with Kinesis analytics applications - Kinesis AnalyticsIngest and process media streams with Kinesis video streams - Kinesis Video Streams All these four parts offer something different. Well, the last two are definitely different than the first two. But it's the first two that I see a lot of people getting confused with. So I thought I'll

How To Generate Parquet Files in Java

parquet logo

Parquet is an open source file format by Apache for the Hadoop infrastructure. Well, it started as a file format for Hadoop, but it has since become very popular and even cloud service providers such as AWS have started supporting the file format. Read more... “How To Generate Parquet Files in Java”

Getting started with Chalice to create AWS Lambdas in Python – Step by Step Tutorial

If you’re into serverless stuff, you already know what is AWS Lambda. But if you don’t know, AWS Lambda is a serverless service provided by Amazon where you can create ‘functions’ and deploy them in AWS, which you can run without having any server instances (such as EC2). Read more... “Getting started with Chalice to create AWS Lambdas in Python – Step by Step Tutorial”

Proof of Concepts (POCs)

I write a lot of POC projects, especially when I'm learning something new or I need to quickly test if a data pipeline works, or maybe I'm just testing a new integration. I make all these POCs public as Github repositories. I wanted to consolidate the list of POCs in an easy to search fashion. And that's why I have this page here. Below is a list of all the POCs that I've written so far. If a particular POC has an accompanying blog post which explains the code in the POC, I have linked that blog post as well in the list below. Let me know if any of these POCs have helped you in any way.

Invoke an AWS Lambda Function from another Lambda Function

I recently discovered that you can't invoke more than one Lambda function in AWS for an S3 event, with the same prefix and suffix (or just with the same suffix, which was the issue in my case). So I wanted a way to invoke one Lambda function from another Lambda function. If you're feeling kind of lost, check out the problem statement in my Github project. That could possibly add some context to the problem. If you don't want to go there, I'll try to explain it here again. The Problem and the Requirement In one of our projects, we have a Lambda function which is invoked whenever a text file is uploaded to a particular S3 bucket. The Lambda function takes

Top