Cleaning and Normalizing Data Using AWS Glue DataBrewData Science by Sunny Srinidhi - January 17, 2022January 17, 20221 A major part of any data pipeline is the cleaning of data. Depending on the project, cleaning data could mean a lot of things. But in most cases, it means normalizing data and bringing data into a format that is accepted within the project. Read more... “Cleaning and Normalizing Data Using AWS Glue DataBrew”
Understanding Apache Hive LLAPData Science by Sunny Srinidhi - November 18, 2021November 18, 20210 Apache Hive is a complex system when you look at it, but once you go looking for more info, it’s more interesting than complex. There are multiple query engines available for Hive, and then there’s LLAP on top of the query engines to make real-time, interactive queries more workable. Read more... “Understanding Apache Hive LLAP”
Installing Hadoop on the new M1 Pro and M1 Max MacBook ProData Science by Sunny Srinidhi - November 5, 2021November 5, 20212 In the previous series of posts, I wrote about how to install the complete Hadoop stack on Windows 11 using WSL 2. And now that the new MacBook Pro laptops are available with the brand new M1 Pro and M1 Max SOCs, here’s a guide on how to install the same Hadoop stack on these laptops. Read more... “Installing Hadoop on the new M1 Pro and M1 Max MacBook Pro”
Installing Hadoop on Windows 11 with WSL2Data Science by Sunny Srinidhi - November 1, 2021November 1, 20213 In the previous post, we saw how to install a Linux distro on Windows 11 using WSL2 and then how to install Zsh and on-my-zsh to make the terminal more customizable. In this post, we’ll see how we can install the complete Hadoop environment on the same Windows 11 machine using WSL. Read more... “Installing Hadoop on Windows 11 with WSL2”
Getting Started With Apache AirflowData Science by Sunny Srinidhi - October 11, 2021October 11, 20210 Apache Airflow is another awesome tool that I discovered just recently. Just a couple of months after discovering it, I can’t imagine not using it now. It’s reliable, configurable, and dynamic. Because it’s all driven by code, you can version control it too. Read more... “Getting Started With Apache Airflow”
Fake (almost) everything with FakerData Science by Sunny Srinidhi - September 30, 2021September 30, 20210 I was recently tasked with creating some random customer data, with names, phone numbers, addresses, and the usual other stuff. At first, I thought I’ll just generate random strings and numbers (some gibberish) and call it a day. But then I remembered my colleagues using a package for that. Read more... “Fake (almost) everything with Faker”
Querying Hive Tables From a Spring Boot AppData ScienceTech by Sunny Srinidhi - June 30, 2021June 30, 20211 In this post, we’ll see how we can query tables that reside in Hive using a Spring Boot application. As always, I’m going to use a Spring Boot web app with a few GET APIs to show how we can query data from Hive. Read more... “Querying Hive Tables From a Spring Boot App”
out() vs. outE() – JanusGraph and GremlinData Science by Sunny Srinidhi - March 3, 2021March 3, 20210 If you are new to JanusGraph and the Gremlin query language, like I am, you would be confused about the out(), outE(), in(), and inE() methods. If you look at examples of these functions, you’ll not be able to comprehend the difference easily. Read more... “out() vs. outE() – JanusGraph and Gremlin”
Getting Started With JanusGraphData Science by Sunny Srinidhi - February 25, 2021February 25, 20211 JanusGraph is a graph processing tool that can process graphs stored on clusters with multiple nodes. JanusGraph is designed for massive clusters and for real-time traversals and analytics queries.In this post, we’ll look at a few queries that you would want to run the very first time you install JanusGraph and start playing with the Gremlin console. Read more... “Getting Started With JanusGraph”
Kinesis Data Streams vs. Kinesis Firehose Delivery StreamsData Science by Sunny Srinidhi - May 25, 2020May 25, 20200 I have talked about Kinesis before, and I'm sure you've been using Kinesis for longer than me. But according to what I've seen, not all teams or companies use all parts of Kinesis. And, there are four parts in Kinesis: Ingest and process streaming data with Kinesis streams - Kinesis Data StreamsDeliver streaming data with Kinesis Firehose delivery streams - Kinesis Firehose Delivery StreamsAnalyse streaming data with Kinesis analytics applications - Kinesis AnalyticsIngest and process media streams with Kinesis video streams - Kinesis Video Streams All these four parts offer something different. Well, the last two are definitely different than the first two. But it's the first two that I see a lot of people getting confused with. So I thought I'll