Home>>Data Science>>Getting Started with Apache Drill and MongoDB
Data ScienceTech

Getting Started with Apache Drill and MongoDB

Not a lot of people have heard of Apache Drill. That is because Drill caters to very specific use cases, it’s very niche. But when used, it can make significant differences to the way you interact with data. First, let’s see what Apache Drill is, and then how we can connect our MongoDB data source to Drill and easily query data.

What is Apache Drill?

According to their website, Apache Drill is “Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.” That’s pretty much self-explanatory. So, Drill is a tool to query Hadoop, MongoDB, and other NoSQL databases. You can write simple SQL queries that run on the data stored in other databases, and you get the result in a row-column format. The best part is you can even query Apache Kafka and AWS S3 data with this.

Let’s see how we can connect to MongoDB and query some data.

Downloading Apache Drill

To test out Apache Drill on your local machine, you only need to download the compressed file, extract it, and run a shell script. And you’re ready to go. You can download Apache Drill from here.

Connect Apache Drill to MongoDB

Once you extract the file you just downloaded, cd into the bin directory and run the following command to run the embedded Drill service:

./drill-embedded

You’ll get a prompt similar to the following:

apache drill>

You’re now running Apache Drill on your local machine. Now let’s connect it to our local MongoDB instance. Open your internet browser and go to http://localhost:8047. This is the Apache Drill web UI. From here, you can configure storage (connections), run SQL queries, and change configurations.

For now, head over to the Storage tab. Here, you’ll find two lists. The first list to the left contains the storage plugins, or connections, that are enabled. To the right, you’ll see plugins which are disabled. You’ll see MongoDB to the right, in the disabled list. Enable the MongoDB plugin and it’ll be added to the left list. The page should look something like this:

You can edit the MongoDB connection string from here if you are not using the default port or want to connect to a remote MongoDB instance. After enabling this plugin, we’ll switch back to the command line interface to run some sample queries.

Before we begin, if you haven’t got any sample data to work with, you can easily generate thousands of records using a Java tool that I wrote. The code is here on Github.

Running Queries in the Apache Drill shell

Before we start querying MongoDB with Drill, let’s take a look at our sample dataset. I have generated 20,000 documents with random values, a subset of that data is here:

{
   "_id": "5d8860652182cc411d5b0335",
   "string1": "GQRTUGDeWL",
   "string2": "QALDs5gU0BI3pbPJaQB1",
   "double1": 95.15978134497485,
   "double2": 149.3989427846131
 },
 {
   "_id": "5d8860652182cc411d5b0336",
   "string1": "o7jpgRBDis",
   "string2": "nUfLFnYrPn3kM2gX7Tpx",
   "double1": 59.50691775921538,
   "double2": 121.2028465590032
 },
 {
   "_id": "5d8860652182cc411d5b0337",
   "string1": "tmDNupDis4",
   "string2": "plFBuQFKcjsfOsrYghX5",
   "double1": 72.08319169219718,
   "double2": 353.94233296797006
 },
 {
   "_id": "5d8860652182cc411d5b0338",
   "string1": "tb3pmNzjR0",
   "string2": "UxVAnBfl7Ae3Wuk45muy",
   "double1": 6.925084784225322,
   "double2": 422.1762689036945
 },
 {
   "_id": "5d8860652182cc411d5b0339",
   "string1": "ZQsKvrKmCD",
   "string2": "FCkhOsXZHIPKWxuVyMEy",
   "double1": 50.42262251682428,
   "double2": 624.9267615790717
 },
 {
   "_id": "5d8860652182cc411d5b033a",
   "string1": "qaFGn8Y3JZ",
   "string2": "QGNBG0G5ZGsWqiVUfjeb",
   "double1": 79.24223653522965,
   "double2": 891.9847982687945
 },
 {
   "_id": "5d8860652182cc411d5b033b",
   "string1": "6DMhppDzpI",
   "string2": "P5kSHZHv99U6E9YER9q6",
   "double1": 73.49960928692676,
   "double2": 816.4658357031391
 },
 {
   "_id": "5d8860652182cc411d5b033c",
   "string1": "uI4Mvq6rG6",
   "string2": "elRK44zAP8BPvnyON78v",
   "double1": 50.182529571564665,
   "double2": 813.6155673555375
 },
 {
   "_id": "5d8860652182cc411d5b033d",
   "string1": "NBWDHVoLeJ",
   "string2": "g3bJAEidJJWyT6EeaoEP",
   "double1": 21.82373793200345,
   "double2": 801.8655464012288
 },
 {
   "_id": "5d8860652182cc411d5b033e",
   "string1": "GVa48x6nzY",
   "string2": "ebhsF5OyxcTq5955Wmuf",
   "double1": 81.42379198684046,
   "double2": 911.1387851805042
 }

As you can see, the data makes no sense.

Now, getting the same dataset from MongoDB using Apache Drill is as simple as the most basic SQL query:

select * from sampleData;

But before this, let’s look at a few other queries.

Run the following query to list all the databases that you have access to. Note that you are not limited to only MongoDB databases, but databases from all the storage plugins that are currently enabled.

Listing all databases.

I have my sample data loaded into a collection in the “mongo.dill” database. So let’s use that:

Switching database.

Let’s now see what tables we have in this database:

Now, we run the select query to list our data:

That’s pretty much it. You can now start experimenting with much complex queries with the data you have.

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: