Query data from S3 files using Amazon AthenaData ScienceTech by Sunny Srinidhi - September 24, 2019March 7, 20201 Amazon Athena is defined as "an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL." So, it's another SQL query engine for large data sets stored in S3. This is very similar to other SQL query engines, such as Apache Drill. But unlike Apache Drill, Athena is limited to data only from Amazon's own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc. In this post, we'll see how we can setup a table in Athena using a sample data set stored in S3 as a .csv file. But for this, we first need
Use Apache Drill with Spring Boot or Java to query data using SQL queriesData ScienceTech by Sunny Srinidhi - September 24, 2019April 7, 20200 In the last few posts, we saw how to connect Apache Drill with MongoDB and also how we can connect it to Kafka to query data using simple SQL queries. But when you want to move this to an actual real world project, you can't sit around querying data from a terminal all day long. You want to write a piece of code which does the dirty work for you. But how exactly do you use Apache Drill within your code? Today, we'll see how we can achieve this with Spring Boot, or pretty much any other Java program. The Dependencies For this POC, I'm going to write a simple Spring Boot CommandLineRunner program. But you can use pretty much any other
Apache Drill vs. Apache Spark – Which SQL query engine is better for you?Data ScienceTech by Sunny Srinidhi - September 23, 2019February 13, 20200 If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that
Analyse Kafka messages with SQL queries using Apache DrillData ScienceTech by Sunny Srinidhi - September 23, 2019January 13, 20201 In the previous post, we figured out how to connect MongoDB with Apache Drill and query data with SQL queries. In this post, let's extend that knowledge and see how we can use similar SQL queries to analyse our Kafka messages. Configuring the Kafka storage plugin in Apache Drill is quite simple, very similar to how we configured the MongoDB storage plugin. First, we run our local instances of Apache Drill, Apache Zookeeper, and Apache Kafka. After this, head over to http://localhost:8047/storage, where we can enable the Kafka plugin. You should see it in the list to the right of the page. Click the Enable button. The storage plugin will be enabled. After this, we need to add a few configuration
Getting Started with Apache Drill and MongoDBData ScienceTech by Sunny Srinidhi - September 23, 2019February 28, 20203 Not a lot of people have heard of Apache Drill. That is because Drill caters to very specific use cases, it's very niche. But when used, it can make significant differences to the way you interact with data. First, let's see what Apache Drill is, and then how we can connect our MongoDB data source to Drill and easily query data. What is Apache Drill? According to their website, Apache Drill is "Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage." That's pretty much self-explanatory. So, Drill is a tool to query Hadoop, MongoDB, and other NoSQL databases. You can write simple SQL queries that run on the data stored in other databases, and you get the result in a row-column format. The
Integrate AWS DynamoDB with Spring BootTech by Sunny Srinidhi - June 26, 2019March 12, 20200 Here is another POC to add to the growing list of POCs on my Github profile. Today, we’ll see how to integrate AWS DynamoDB with a Spring Boot application. This is going to be super simple, thanks to the AWS Java SDK and the Spring Data DynamoDB package. Let’s get started then. Dependencies First, as usual, we need to create a Spring Boot project, the dependencies of which look like: <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> </dependency> <dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-java-sdk-dynamodb</artifactId> <version>1.11.573</version>
Apache Spark SQL User Defined Function (UDF) POC in JavaData ScienceTech by Sunny Srinidhi - May 14, 2019December 19, 20192 If you’ve worked with Spark SQL, you might have come across the concept of User Defined Functions (UDFs). As the name suggests, it’s a feature where you define a function, pretty straight forward. But how is this different from any other custom function that you write? Well, when you’re working with Spark in a distributed environment, your code is distributed across the cluster. For this to happen, your code entities have to be serializable, including the various functions you call. When you want to manipulate columns in your Dataset, Spark provides a variety of built-in functions. But there are cases when you want a custom implementation to work with your columns. For this, Spark provides UDF. But you should be warned,
Connect Apache Spark with MongoDB database using the mongo-spark-connectorData ScienceTech by Sunny Srinidhi - April 3, 2019February 28, 20200 A couple of days back, we saw how we can connect Apache Spark to an Apache HBase database and query the data from a table using a catalog. Today, we’ll see how we can connect Apache Spark to a MongoDB database and get data directly into Spark from there. MongoDB provides us a plugin called the mongo-spark-connector, which will help us connect MongoDB and Spark without any drama at all. We just need to provide the MongoDB connection URI in the SparkConf object, and create a ReadConfig object specifying the collection name. It might sound complicated right now, but once you look at the code, you’ll understand how extremely easy this is. So, let’s look at an example. The Dataset Before we look
Connect Apache Spark to your HBase database (Spark-HBase Connector)Data ScienceTech by Sunny Srinidhi - April 1, 2019January 31, 20202 There will be times when you’ll need the data in your HBase database to be brought into Apache Spark for processing. Usually, you’ll query the database, get the data in whatever format you fancy, and then load that into Spark, maybe using the `parallelize()`function. This works, just fine. But depending on the size of the data, this could cause delays. At least it did for our application. So after some research, we stumbled upon a Spark-HBase connector in Hortonworks repository. Now, what is this connector and why should you be considering this? The Spark-HBase Connector (shc-core) The SHC is a tool provided by Hortonworks to connect your HBase database to Apache Spark so that you can tell your Spark context to pickup the
How you can improve your backend services’ performance using Apache KafkaTech by Sunny Srinidhi - November 27, 2018February 25, 20201 In most real world applications, we have a RESTful API service facing various client applications and a collection of backend services which process the data coming from those clients. Depending on the application, the architecture might have various services spread across multiple clusters of servers, and some form of queue or messaging service gluing them together. Today, we're going to talk about one such messaging service - Apache Kafka - and how it can improve the performance of your services. We're going to assume that we have at least two microservices, one for the APIs that are exposed to the world, and one which processes the requests coming in from the API microservice, but in an async fashion. Because this is