Apache Spark Optimisation Techniques Data Science by Sunny Srinidhi - February 23, 2023February 23, 20230 Apache Spark is one of the most popular big data processing tools today. It’s used extensively for data sizes small to large. The availability of Spark in more than one programming language makes it a favourite tool for data engineers and data scientists coming from various backgrounds. Read more... “Apache Spark Optimisation Techniques”
Apache Drill vs. Apache Spark – Which SQL query engine is better for you? Data Science Tech by Sunny Srinidhi - September 23, 2019February 13, 20200 If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that
Apache Spark SQL User Defined Function (UDF) POC in Java Data Science Tech by Sunny Srinidhi - May 14, 2019December 19, 20192 If you’ve worked with Spark SQL, you might have come across the concept of User Defined Functions (UDFs). As the name suggests, it’s a feature where you define a function, pretty straight forward. But how is this different from any other custom function that you write? Well, when you’re working with Spark in a distributed environment, your code is distributed across the cluster. For this to happen, your code entities have to be serializable, including the various functions you call. When you want to manipulate columns in your Dataset, Spark provides a variety of built-in functions. But there are cases when you want a custom implementation to work with your columns. For this, Spark provides UDF. But you should be warned,
Connect Apache Spark with MongoDB database using the mongo-spark-connector Data Science Tech by Sunny Srinidhi - April 3, 2019February 28, 20200 A couple of days back, we saw how we can connect Apache Spark to an Apache HBase database and query the data from a table using a catalog. Today, we’ll see how we can connect Apache Spark to a MongoDB database and get data directly into Spark from there. MongoDB provides us a plugin called the mongo-spark-connector, which will help us connect MongoDB and Spark without any drama at all. We just need to provide the MongoDB connection URI in the SparkConf object, and create a ReadConfig object specifying the collection name. It might sound complicated right now, but once you look at the code, you’ll understand how extremely easy this is. So, let’s look at an example. The Dataset Before we look
Connect Apache Spark to your HBase database (Spark-HBase Connector) Data Science Tech by Sunny Srinidhi - April 1, 2019January 31, 20202 There will be times when you’ll need the data in your HBase database to be brought into Apache Spark for processing. Usually, you’ll query the database, get the data in whatever format you fancy, and then load that into Spark, maybe using the `parallelize()`function. This works, just fine. But depending on the size of the data, this could cause delays. At least it did for our application. So after some research, we stumbled upon a Spark-HBase connector in Hortonworks repository. Now, what is this connector and why should you be considering this? The Spark-HBase Connector (shc-core) The SHC is a tool provided by Hortonworks to connect your HBase database to Apache Spark so that you can tell your Spark context to pickup the
Optimising Hive Queries with Tez Query Engine Data Science by Sunny Srinidhi - June 13, 2022June 13, 20220 Hive provides us the option of executing SQL queries with a few different query engines. It ships with the native MapReduce engine. But we can switch that to Tez which has gained popularity since its launch, or we can also use Apache Spark as well. Read more... “Optimising Hive Queries with Tez Query Engine”
The Dunning-Kruger Effect In Tech Tech by Sunny Srinidhi - November 28, 2021December 18, 20210 This is not the kind of post I usually write on my blog. This is more of a psychology lecture than a how-to tech tutorial. But it’s not completely irrelevant as well, because I’m going to talk about my experience with the Dunning-Kruger effect in tech that I’ve seen over the last decade. Read more... “The Dunning-Kruger Effect In Tech”
Understanding Apache Hive LLAP Data Science by Sunny Srinidhi - November 18, 2021November 18, 20210 Apache Hive is a complex system when you look at it, but once you go looking for more info, it’s more interesting than complex. There are multiple query engines available for Hive, and then there’s LLAP on top of the query engines to make real-time, interactive queries more workable. Read more... “Understanding Apache Hive LLAP”
Getting Started With Apache Airflow Data Science by Sunny Srinidhi - October 11, 2021October 11, 20210 Apache Airflow is another awesome tool that I discovered just recently. Just a couple of months after discovering it, I can’t imagine not using it now. It’s reliable, configurable, and dynamic. Because it’s all driven by code, you can version control it too. Read more... “Getting Started With Apache Airflow”
Redundancy in a distributed system Tech by Sunny Srinidhi - April 13, 20200 A lot of engineers, system designers, architects, etc. overlook redundancy, at least according to what I’ve seen in my experience. Sometimes people ignore it because the system or the product is still in it’s early stages, so there’s not a lot happening. Read more... “Redundancy in a distributed system”