Query data from S3 files using Amazon AthenaData ScienceTech by Sunny Srinidhi - September 24, 2019March 7, 20201 Amazon Athena is defined as "an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL." So, it's another SQL query engine for large data sets stored in S3. This is very similar to other SQL query engines, such as Apache Drill. But unlike Apache Drill, Athena is limited to data only from Amazon's own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc. In this post, we'll see how we can setup a table in Athena using a sample data set stored in S3 as a .csv file. But for this, we first need
Apache Drill vs. Apache Spark – Which SQL query engine is better for you?Data ScienceTech by Sunny Srinidhi - September 23, 2019February 13, 20200 If you are in the big data or data science or BI space, you might have heard about Apache Spark. A few of you might have also heard about Apache Drill, and a tiny bit of you might have actually worked with it. I discovered Apache Drill very recently. But since then, I've come to like what it has to offer. But the first thing that I wondered when I glanced over the capabilities of Apache Drill was, how is this different from Apache Spark? Can I use the two interchangeably? I did some research and found the answers. Here, I'm going to answer these questions for myself and maybe for you guys too. It is very important to understand that
Getting Started with Apache Drill and MongoDBData ScienceTech by Sunny Srinidhi - September 23, 2019February 28, 20203 Not a lot of people have heard of Apache Drill. That is because Drill caters to very specific use cases, it's very niche. But when used, it can make significant differences to the way you interact with data. First, let's see what Apache Drill is, and then how we can connect our MongoDB data source to Drill and easily query data. What is Apache Drill? According to their website, Apache Drill is "Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage." That's pretty much self-explanatory. So, Drill is a tool to query Hadoop, MongoDB, and other NoSQL databases. You can write simple SQL queries that run on the data stored in other databases, and you get the result in a row-column format. The
Apache Spark SQL User Defined Function (UDF) POC in JavaData ScienceTech by Sunny Srinidhi - May 14, 2019December 19, 20192 If you’ve worked with Spark SQL, you might have come across the concept of User Defined Functions (UDFs). As the name suggests, it’s a feature where you define a function, pretty straight forward. But how is this different from any other custom function that you write? Well, when you’re working with Spark in a distributed environment, your code is distributed across the cluster. For this to happen, your code entities have to be serializable, including the various functions you call. When you want to manipulate columns in your Dataset, Spark provides a variety of built-in functions. But there are cases when you want a custom implementation to work with your columns. For this, Spark provides UDF. But you should be warned,
Connect Apache Spark to your HBase database (Spark-HBase Connector)Data ScienceTech by Sunny Srinidhi - April 1, 2019January 31, 20202 There will be times when you’ll need the data in your HBase database to be brought into Apache Spark for processing. Usually, you’ll query the database, get the data in whatever format you fancy, and then load that into Spark, maybe using the `parallelize()`function. This works, just fine. But depending on the size of the data, this could cause delays. At least it did for our application. So after some research, we stumbled upon a Spark-HBase connector in Hortonworks repository. Now, what is this connector and why should you be considering this? The Spark-HBase Connector (shc-core) The SHC is a tool provided by Hortonworks to connect your HBase database to Apache Spark so that you can tell your Spark context to pickup the
What is multicollinearity?Data Science by Sunny Srinidhi - August 8, 2018January 30, 20200 Multicollinearity is a term we often come across when we’re working with multiple regression models. But do we actually know what it means?
Overfitting and Underfitting models in Machine LearningData Science by Sunny Srinidhi - August 2, 20180 In most of our posts about machine learning, we've talked about overfitting and underfitting. But most of us don't yet know what those two terms mean. What does it acutally mean when a model is overfit, or underfit? Why are they considered not good? And how do they affect the accuracy of our model's predictions? These are some of the basic, but important questions we need to ask and get answers to. So let's discuss these two today. The datasets we use for training and testing our models play a huge role in the efficiency of our models. Its equally important to understand the data we're working with. The quantity and the quality of the data also matter, obviously. When the data
Different types of Validations in Machine Learning (Cross Validation)Data Science by Sunny Srinidhi - August 1, 20180 Now that we know what is feature selection and how to do it, let's move our focus to validating the efficiency of our model. This is known as validation or cross validation, depending on what kind of validation method you're using. But before that, let's try to understand why we need to validate our models. Validation, or Evaluation of Residuals Once you are done with fitting your model to you training data, and you've also tested it with your test data, you can't just assume that its going to work well on data that it has not seen before. In other words, you can't be sure that the model will have the desired accuracy and variance in your production environment. You need
Different methods of feature selectionData Science by Sunny Srinidhi - July 31, 2018November 6, 20191 In our previous post, we discussed what is feature selection and why we need feature selection. In this post, we're going to look at the different methods used in feature selection. There are three main classification of feature selection methods - Filter Methods, Wrapper Methods, and Embedded Methods. We'll look at all of them individually. Filter Methods Filter methods are learning-algorithm-agnostic, which means they can be employed no matter which learning algorithm you're using. They're generally used as data pre-processors. In filter methods, each individual feature in the dataset will be scored on its correlation with the dependent variable. A variety of statistical tests will be used to calculate this correlation score. Based on this score, it will be decided whether to
What is Feature Selection and why do we need it in Machine Learning?Data Science by Sunny Srinidhi - July 31, 2018November 11, 20192 If you've come across a dataset in your machine learning endeavors which has more than one feature, you'd have also heard of a concept called Feature Selection. Today, we're going to find out what it is and why we need it. When a dataset has too many features, it would not be ideal to include all of them in our machine learning model. Some features may be irrelevant for the independent variable. For example, if you are going to predict how much it would cost to crush a car, and the features you're given are: the dimensions of the car if the car will be delivered to the crusher or the company has to go pick it up if the car