Apache Spark Optimisation Techniques

Data Science

by Sunny Srinidhi - February 23, 2023February 23, 20230

Apache Spark is a popular big data processing tool. In this post, we are going to look at a few techniques using which we can optimise the performance of our Spark jobs.

Optimising Hive Queries with Tez Query Engine

Data Science

by Sunny Srinidhi - June 13, 2022June 13, 20220

Hive and Tez configuration can be fine-tuned to improve the performance of queries. Let’s look at a few such techniques.

Understanding Apache Hive LLAP

Data Science

by Sunny Srinidhi - November 18, 2021November 18, 20210

In this post, I try to explain what LLAP is for Apache Hive and how it can help us in reducing query latency.

Installing Hadoop on the new M1 Pro and M1 Max MacBook Pro

Data Science

by Sunny Srinidhi - November 5, 2021November 5, 20213

We’ll see how to install and configure Hadoop and it’s components on MacOS running on the new M1 Pro and M1 Max chips by Apple.

Installing Hadoop on Windows 11 with WSL2

Data Science

by Sunny Srinidhi - November 1, 2021November 1, 20213

We’ll see how to install and configure Hadoop and it’s components on Windows 11 running a Linux distro using WSL 1 or 2.

Installing Zsh and Oh-my-zsh on Windows 11 with WSL2

Tech

by Sunny Srinidhi - October 27, 2021October 27, 20211

In this post, which is a part of a series of to setup Windows 11 and WSL2 for big data work, I install Zsh and Oh-my-zsh and setup up aliases

Getting Started With Apache Airflow

Data Science

by Sunny Srinidhi - October 11, 2021October 11, 20210

I recently started working with Apache Airflow. And as is tradition, I’m telling you everything about it here.

Fake (almost) everything with Faker

Data Science

by Sunny Srinidhi - September 30, 2021September 30, 20210

Generating customer and address data for testing has never been easier. We’ll see how to do that using the Faker Python library.

Querying Hive Tables From a Spring Boot App

by Sunny Srinidhi - June 30, 2021June 30, 20211

In this post, we’ll see how to connect to a Hive database and run queries on that database from a Spring Boot application.

Getting Started With JanusGraph

Data Science

by Sunny Srinidhi - February 25, 2021February 25, 20211

JanusGraph is a graph processing tool that can query distributed graph data in milliseconds. In this post, we’ll see how to get started with it.