You are here
Home > Search Results for "s3"

Explore your Amazon S3 data online using Filestash


Amazon’s S3, or Simple Storage Service, has become one of the most used cloud services today. We use it for all kind of purposes, including but not limited to data lakes, intermediary storage, persistence layer for databases, etc. I know people who use S3 as their personal online storage, as an alternative for services such Google Drive and Dropbox. Read more... “Explore your Amazon S3 data online using Filestash”

How to build a simple data lake using Amazon Kinesis Data Firehose and Amazon S3

data lake

As the data generated from IoT devices, mobile devices, applications, etc. increases at an hourly rate, creating a data lake to store all that data is getting crucial for almost any application at scale. There are many tools and services that you could use to create a data lake. Read more... “How to build a simple data lake using Amazon Kinesis Data Firehose and Amazon S3”

Query data from S3 files using Amazon Athena

amazon athena

Amazon Athena is defined as "an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL." So, it's another SQL query engine for large data sets stored in S3. This is very similar to other SQL query engines, such as Apache Drill. But unlike Apache Drill, Athena is limited to data only from Amazon's own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc. In this post, we'll see how we can setup a table in Athena using a sample data set stored in S3 as a .csv file. But for this, we first need

Use Amazon CloudSearch to quickly search through data

amazon cloudsearch

Most applications today require a search functionality in them to let users search for content easily and quickly. But building that search feature is not a small task. It often requires specialized knowledge and massive compute resources to be able to search through massive amounts of data quickly. Read more... “Use Amazon CloudSearch to quickly search through data”

Cleaning and Normalizing Data Using AWS Glue DataBrew


A major part of any data pipeline is the cleaning of data. Depending on the project, cleaning data could mean a lot of things. But in most cases, it means normalizing data and bringing data into a format that is accepted within the project. Read more... “Cleaning and Normalizing Data Using AWS Glue DataBrew”

Getting Started With Apache Airflow


Apache Airflow is another awesome tool that I discovered just recently. Just a couple of months after discovering it, I can’t imagine not using it now. It’s reliable, configurable, and dynamic. Because it’s all driven by code, you can version control it too. Read more... “Getting Started With Apache Airflow”

Kinesis Data Streams vs. Kinesis Firehose Delivery Streams


I have talked about Kinesis before, and I'm sure you've been using Kinesis for longer than me. But according to what I've seen, not all teams or companies use all parts of Kinesis. And, there are four parts in Kinesis: Ingest and process streaming data with Kinesis streams - Kinesis Data StreamsDeliver streaming data with Kinesis Firehose delivery streams - Kinesis Firehose Delivery StreamsAnalyse streaming data with Kinesis analytics applications - Kinesis AnalyticsIngest and process media streams with Kinesis video streams - Kinesis Video Streams All these four parts offer something different. Well, the last two are definitely different than the first two. But it's the first two that I see a lot of people getting confused with. So I thought I'll

How To Generate Parquet Files in Java

parquet logo

Parquet is an open source file format by Apache for the Hadoop infrastructure. Well, it started as a file format for Hadoop, but it has since become very popular and even cloud service providers such as AWS have started supporting the file format. Read more... “How To Generate Parquet Files in Java”

Getting started with Chalice to create AWS Lambdas in Python – Step by Step Tutorial

If you’re into serverless stuff, you already know what is AWS Lambda. But if you don’t know, AWS Lambda is a serverless service provided by Amazon where you can create ‘functions’ and deploy them in AWS, which you can run without having any server instances (such as EC2). Read more... “Getting started with Chalice to create AWS Lambdas in Python – Step by Step Tutorial”

Proof of Concepts (POCs)

I write a lot of POC projects, especially when I'm learning something new or I need to quickly test if a data pipeline works, or maybe I'm just testing a new integration. I make all these POCs public as Github repositories. I wanted to consolidate the list of POCs in an easy to search fashion. And that's why I have this page here. Below is a list of all the POCs that I've written so far. If a particular POC has an accompanying blog post which explains the code in the POC, I have linked that blog post as well in the list below. Let me know if any of these POCs have helped you in any way.