You are here
Home > Data Science >

Query data from S3 files using Amazon Athena

amazon athena

Amazon Athena is defined as “an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.” So, it’s another SQL query engine for large data sets stored in S3. This is very similar to other SQL query engines, such as Apache Drill. But unlike Apache Drill, Athena is limited to data only from Amazon’s own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc.


In this post, we’ll see how we can setup a table in Athena using a sample data set stored in S3 as a .csv file. But for this, we first need that sample CSV file. You can download it here:

Once you have the file downloaded, create a new bucket in AWS S3. I suggest creating a new bucket so that you can use that bucket exclusively for trying out Athena. But you can use any existing bucket as well.

So, now that you have the file in S3, open up Amazon Athena. You’ll get an option to create a table on the Athena home page. Mine looks something similar to the screenshot below, because I already have a few tables. As you can see from the screenshot, you have multiple options to create a table. For this post, we’ll stick with the basics and select the “Create table from S3 bucket data” option.

Once you select that option, you’ll be redirected to a four-step process of creating a table. Let’s look at each of these steps briefly.

Step 1: Name & Location

As you can see from the screen above, in this step, we define the database, the table name, and the S3 folder from where the data for this table will be sourced. If you already have a database, you can select it from the drop down, like what I’ve done. If not, you have the option of creating a database right from this screen.

Next, provide a name for the table. For this example, I’ve named the table sampleData, just to keep it same as the CSV file I’m using.

Next, you have to provide the path of the folder in S3 where you have the file stored. Note that you can’t provide the file path, you can only provide the folder path. So all the files in that folder with the matching file format will be used as the data source. Since we only have one file, our data will be limited to that. We’ll ignore the encryption option in this post.

Let’s also note here that Athena does not copy over any data from these source files to another location, memory or storage. Every query is run against the original data set.

Step 2: Data Format

This is a pretty straight forward step. You just select the file format of your data source. Because we’re using a CSV file, we’ll select CSV as the data format.

Step 3: Columns

In this third step, we define the “columns” or the fields in each document / record in our data set. This is required so that Athena knows the schema of the data we’re working with. Any field or column which is not defined here, or has a typo in the name, i.e., misconfigured, will be ignored and replaced with empty values. So make sure you configure the columns properly.

In case your data set has too many columns, and it becomes tedious to configure each of them individually, you can add columns in bulk as well. You’ll find the option for that at the bottom of the page. For example, the bulk configuration for our example looks like this:

_id string, string1 string, string2 string, double1 double, double2 double

As you can see, the format is pretty simple. You specify the name of the column, followed by a space, followed by the type of data in that column. Column definitions are delimited using a comma.

Step 4: Partitions

This step is a bit advanced, which deals with partitions. Since our data is pretty small, and also because it is kind of out of the scope of this particular post, we’ll skip this step for now. So ignore this step, and confirm the rest of the configuration.

Your Athena query setup is now complete. You’ll be taken to the query page. Here, you’ll get the CREATE TABLE query with the query used to create the table we just configured. You don’t have to run this query, as the table is already created and is listed in the left pane. This query is displayed here only for your reference. Maybe you can create this query manually next time instead of going through three to four steps in the console.

What’s left now is to query the table and see if our configuration is proper. To test this out, we’ll run this simple SQL query:

select * from sampledata limit 10;

After running this query, your output should be similar to the following screenshot.

We can even run aggregation queries on this data set. Here are a couple of examples:

select avg(double1) from sampledata;
select sum(double2) from sampledata limit 20;

That’s pretty much it. You can now run SQL queries on your file-based data sources from S3 in Athena. This comes in very handy when you have to analyse huge data sets which are stored as multiple files in S3. Depending on how your data is distributed across files and in which file format, your queries will be very performant. You can query hundreds of GBs of data in S3 and get back results in just a few seconds. Even complex queries with multiple joins return pretty quickly. Probably we’ll talk more about this some other day.

Sunny Srinidhi
Coding, reading, sleeping, listening, watching, potato. INDIAN. "If you don't have time to do it right, when will you have time to do it over?" - John Wooden
https://blog.contactsunny.com

One thought on “Query data from S3 files using Amazon Athena

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Top