Data ScienceTech

Apache Drill vs. Apache Spark – Which SQL query engine is better for you?

If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that there is a fundamental difference between the two, how they are implemented, and what they are capable of. With Apache Drill, we write SQL quer...

Read More
Data ScienceTech

Apache Spark SQL User Defined Function (UDF) POC in Java

If you’ve worked with Spark SQL, you might have come across the concept of User Defined Functions (UDFs). As the name suggests, it’s a feature where you define a function, pretty straight forward. But how is this different from any other custom function that you write? Well, when you’re working with Spark in a distributed environment, your code is distributed across the cluster. For this to happen, your code entities have to be serializable, including the various functions you call. When you want to manipulate columns in your Dataset, Spark provides a variety of built-in functions. But there are cases when you want a custom implementation to work with your columns. For this, Spark provides UDF. But you should be warned, UDFs should be used as sparingly as possible. This is because ...

Read More