If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I’ve come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I’m going to answer these questions for myself and maybe for you guys too.
It is very important to understand that there is a fundamental difference between the two, how they are implemented, and what they are capable of.
With Apache Drill, we write SQL queries to fetch data from a variety of sources, such as SQL databases, MongoDB, AWS S3, Apache Kafka, JSON files, and many more. But with Apache Spark, we write “SQL-Like” queries to fetch data from various data sources. The difference here is that Drill is an ANSI SQL:2003 query engine, with more powerful features than a traditional SQL query engine. Apache Spark, on the other hand, supports only a subset of SQL queries, and in a limited fashion. Spark is a generic computation engine, which has support for SQL like queries to get the data that it needs to run those computations on. Spark also has drivers for various storage services on which you want to run your queries.
The list of data sources that SparkSQL supports is also quite long, but not as long as that of Drill. People who have been working with BI tools for long enough find themselves at home with SQL queries. Bringing that knowledge with them over to Apache Drill will enable them to query data from data sources which might be new to them, such as NoSQL databases, Parquet files, S3 files, etc. Such people can run complex aggregation queries using Apache Drill; and most of the time, they get the results they need. The same cannot be said about SparkSQL.
SparkSQL is ideal in cases where you to process data inside Spark, and you want to fetch data from external sources with some simple joins and where conditions. Once you get the data into your Spark Datasets, you can then apply your transformations and perform your actions.
The way these two tools handle the underlying data is quite different. Apache Drill is able to take complex queries which run on multiple data sources, and is able to push down those queries to native drivers of the concerned data sources. It maps the data internally to JSON formats, and it’s ability to work with JSON makes Drill much more flexible in a way compared to Spark.
Drill also extends the ANSI SQL queries with commands such as FLATTEN and KVGEN, which make it easier to handle simple and even complex JSON data. And since Drill works with JSON internally, it enables Drill to discover schema on the fly.
To conclude, if there’s a lot of Math, data transformation, and machine learning you want to do, better stick with Apache Spark. If that’s not the case and you only want to run some aggregation queries to analyse data, you can go with Apache Drill.
What’s interesting is, Drill supports JDBC driver, and Spark can use JDBC driver within SparkSQL to fetch data. So if you want fine control over what data you bring into Spark and want to run complex transformations on that data, you can use Drill within Spark to get the data into Spark. This combination should be enough for most applications.