You are here
Home > Search Results for "spark"

Apache Spark Optimisation Techniques


Apache Spark is a popular big data processing tool. In this post, we are going to look at a few techniques using which we can optimise the performance of our Spark jobs.

Apache Drill vs. Apache Spark – Which SQL query engine is better for you?


If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that

Apache Spark SQL User Defined Function (UDF) POC in Java


If you’ve worked with Spark SQL, you might have come across the concept of User Defined Functions (UDFs). As the name suggests, it’s a feature where you define a function, pretty straight forward. But how is this different from any other custom function that you write? Well, when you’re working with Spark in a distributed environment, your code is distributed across the cluster. For this to happen, your code entities have to be serializable, including the various functions you call. When you want to manipulate columns in your Dataset, Spark provides a variety of built-in functions. But there are cases when you want a custom implementation to work with your columns. For this, Spark provides UDF. But you should be warned,

Connect Apache Spark with MongoDB database using the mongo-spark-connector


A couple of days back, we saw how we can connect Apache Spark to an Apache HBase database and query the data from a table using a catalog. Today, we’ll see how we can connect Apache Spark to a MongoDB database and get data directly into Spark from there. MongoDB provides us a plugin called the mongo-spark-connector, which will help us connect MongoDB and Spark without any drama at all. We just need to provide the MongoDB connection URI in the SparkConf object, and create a ReadConfig object specifying the collection name. It might sound complicated right now, but once you look at the code, you’ll understand how extremely easy this is. So, let’s look at an example. The Dataset Before we look

Connect Apache Spark to your HBase database (Spark-HBase Connector)


There will be times when you’ll need the data in your HBase database to be brought into Apache Spark for processing. Usually, you’ll query the database, get the data in whatever format you fancy, and then load that into Spark, maybe using the `parallelize()`function. This works, just fine. But depending on the size of the data, this could cause delays. At least it did for our application. So after some research, we stumbled upon a Spark-HBase connector in Hortonworks repository. Now, what is this connector and why should you be considering this? The Spark-HBase Connector (shc-core) The SHC is a tool provided by Hortonworks to connect your HBase database to Apache Spark so that you can tell your Spark context to pickup the

Real-Time Data Processing: Understanding the What, Why, Where, Who, and How

data processing

In today’s data-driven world, businesses and organizations are continuously generating massive amounts of data. While processing data in batch mode remains useful, the need for instant decision-making has led to an increasing focus on real-time data processing. This article delves into what real-time data processing is, why it's essential, its various applications, the tools used to achieve it, trends shaping its evolution, and real-world use cases. What is Real-Time Data Processing? Real-time data processing refers to the capability to continuously ingest, process, and output data as soon as it is generated, with minimal latency. Unlike batch processing, which collects and processes data in large groups at set intervals (e.g., daily or hourly), real-time processing works with data immediately as it becomes available,

Optimising Hive Queries with Tez Query Engine


Hive and Tez configuration can be fine-tuned to improve the performance of queries. Let’s look at a few such techniques.

Understanding Apache Hive LLAP

apache hive

In this post, I try to explain what LLAP is for Apache Hive and how it can help us in reducing query latency.
