March 19, 2024

The ContactSunny Blog

Tech from one dev to another

Optimising Hive Queries with Tez Query Engine

4 min read
Hive and Tez configuration can be fine-tuned to improve the performance of queries. Let's look at a few such techniques.
performance

Hive provides us the option of executing SQL queries with a few different query engines. It ships with the native MapReduce engine. But we can switch that to Tez which has gained popularity since its launch, or we can also use Apache Spark as well. Most production deployments of Hive today use the Tez query engine.

In this post, we are going to look at a few Hive configuration parameters that we can use to tune the performance of our Hive queries. Using default configuration might not be the best option always. Customising some of the configuration parameters could sometimes result in as much as 50% improvement in query performance. Let’s see some of these configuration parameters.


Number of Reducers

By default, Tez determines how many reducers to use for a query depending on the number of bytes of data processed. But of course, we can override this and provide a constant number. It should be obvious that a constant number of reducers is not optimal for all queries. We will have to change the number of executors for each query and check the performance in a brute force method. But as the amount of data changes, this value will change as well.

Tez uses the following configuration parameter to control the number of reducers:

mapred.reduce.tasks

By default the value is set to -1, which means Tez controls the number of executors dynamically. This is also the recommendation. Tez does not randomly assign a value to this configuration, of course. There’s a formula for calculating the number of reducers. We can use this too to set a value here, but we can’t guarantee that number is the best for all queries, or the same query over time.

The formula is:

Max(1, Min(hive.exec.reducers.max, ReducerStage estimate/hive.exec.reducers.byte.per.reducer)) x hive.tez.max.partition.factor

Instead of using a constant number of executors, we can enable cost based optimisation and vectorisation, and compute the statistics on a table to let Tez generate a much better logical and physical execution plan. We’ll talk more on this later.


Index the Filters

We use partitioning to physically partition the data based on the values of one or more columns in our Hive tables. This leads to an optimisation in the performance of Hive queries as we are reducing the amount of data that has to be processed. This is the reason it is always advised to use partitions and use where conditions in queries to filter based on these partitions. But did you know that you can index these filters to further increase the performance?

Hive provides the following configuration parameter to enable indexing on filters. The advantages here should be obvious.

set hive.optimize.index.filter=true; 

Use ‘Fetch Task’ for Simpler Queries

We know the first stage of a Hive query is to use a mapper to map the data. But for some simple queries, such as select *, select * ... limit 10, etc. there’s not much to do for a mapper. For such queries, we can directly fetch the data from the source and apply the limits or filters. For this, Hive provides the following configuration:

set hive.fetch.task.conversion=more;

Using this configuration, we can see such simple queries return within a second, whereas a MapReduce job for the same query could take minutes.


Use Stats for Queries

As I already mentioned, we can compute statistics on a table to optimise the performance of some aggregation queries. If stats are available for a table, Tez uses those stats to change the logical and physical execution plan accordingly to optimise the read and computation. We can calculate the required stats using the following command:

analyze table <table_name> compute statistics;

Once we compute the stats for a table like this, we can use the following configuration to let queries use this new data to optimise the performance:

set hive.compute.query.using.stats=true;

Enable Vectorisation

Doing anything is batches will be comparatively faster than doing them individually. Enabling vectorisation will improve the performance of aggregation queries, scans, joins, filters, etc. by performing these operations in batches of 1024 rows instead of one single row. There are two configuration parameters for this:

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

Enable Cost Based Optimisation

By default, Cost Based Optimisation (CBO) is turned off. Turning it on lets Tez know that each query’s logical and physical plan needs to be optimised for the query cost before submitting it for execution. Enabling CBO will force Tez to come up with a different plan to reduce the query cost. This will affect decisions such as the amount of parallelism for a query, the type of join, how to apply that join, and much more.

To enable CBO, we first need to compute stats on the table. This is using the analyze table command we saw earlier. Once we have the stats computed, we can enable the following configuration parameters to enable CBO on our queries:

set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Summary

There’s obviously a lot that we can do for optimising the performance of Hive queries. Maybe we’ll keep that for part two of this post? And of course, not all of this will work for all use cases. Please perform A/B testing for these techniques on your data to see how it affects the performance. If and when you see changes, you can promote them to production.

And if you like what you see here, or on my Medium blog and Dev.To blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Become a Patron!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.