Apache Hive is a complex system when you look at it, but once you go looking for more info, it’s more interesting than complex. There are multiple query engines available for Hive, and then there’s LLAP on top of the query engines to make real-time, interactive queries more workable. Live Long And Process, or LLAP as it’s more popularly known as, is an awesome concept and execution when you learn more about it.
And that’s exactly what I did, I read a few documents and other posts about LLAP and got a better understanding of it. I understood the Tez query engine too, and expect a post about it soon. In this, I’ll unpack whatever I learnt about LLAP and how it’s awesome if you use it.
Hive Query Engines
There are three query engines for Hive:
- MapReduce (MR)
MapReduce is the first query engine that shipped with Hive, and this is also the slowest of the bunch. When you submit a Hive query with MR as the query engine, every query gets converted into MapReduce jobs and gets submitted to YARN containers. YARN, or Yet Another Resource Negotiator, is common between MR and Tez query engines. But the problem with the MR query engine is that all queries need to be converted to MR jobs. The conversion itself takes time. So you can imagine how this query engine becomes slow with a lot of latency.
To overcome this latency issue, Tez was introduced as a query engine with later versions of Hive. Tez uses Directed Acyclic Graphs (DAGs) to process queries instead of MapReduce jobs. This greatly reduces latency and improves the query response times. WIth the latest version of Hive, even though MR query engine is deprecated, it is the default engine. But you get the deprecation warning whenever you enter the Hive shell and a recommendation to switch to either Tez or Spark as the query engine. And it is universally suggested to switch to Tez. We’ll see why in my next post.
Finally, we have Apache Spark as the third query engine option. And Spark, by far, is the fastest of them all. There are claims that Spark can improve the performance of Hive by as much as 100x, that’s a very bold claim. Tez doesn’t offer such high boost in performance, the much accepted boost is 10x. So you might say, well, Spark is the clear winner, right? Well, that depends! Tez and Spark both use DAGs to optimise the query performance of MR. So there can be a lot of parallel or concurrent execution of tasks in both.
The choice between Tez and Spark for a query engine totally comes down to your application and needs. But no matter what you choose, LLAP and sit on top of that query engine and better the performance even more. For example, we know that Spark can cache (persist) data in memory or on disk if we need that data again. But this caching is available within the same Spark job, another Spark job can’t access that data. What if I tell you that’s possible with LLAP? I know, that’s crazy. So, let’s see what else LLAP can do.
Live Long And Process (LLAP)
We’re finally going to talk about LLAP. The first you need to know, as I already made it clear, is that LLAP is not a query engine. It sits on top of the query engine to make query and data processing that much faster. If you image the various Hive components to be arranged as a stack, you have HDFS at the bottom, YARN on top of it, and then Hive itself at the top. Now imagine a layer of cache and in-memory processing on top of HDFS. This means a lot of requests don’t go to HDFS at all. That’s what, at a very high level, LLAP does.
I’m definitely over-simplifying stuff when I say LLAP is a caching and in-memory processing layer. There’s definitely a better way to put it. Let me elaborate.
You can think of LLAP as just another YARN application running on all data nodes in a Hadoop cluster. The only difference is that LLAP is a long-lived process (hence the name). But this doesn’t mean that it eats up all your resources. It can be configured to a very tiny process to process simple queries, or it can be configured to dynamically scale out and down whenever required. This brings in a very big difference compared to Spark. And because LLAP still work on YARN, it brings in all the advantages of YARN such as distributed nature and fault tolerance.
LLAP runs daemon processes on data nodes, and these daemons are not tied to a user who is issuing Hive queries. And this is a very important distinction, because this allows LLAP to reuse cached data across users. So you and I both fire similar queries on the same table, LLAP will be able to use the cache that’s already available for both our queries, and that’ll boost the performance for both of us. Without LLAP, both the queries would have to perform the same operations individually. And you can imagine how that’s not very optimal.
LLAP is not a query engine, and it’s optional
You should’ve realised by now, LLAP is completely optional. It’s not mandatory to use LLAP for Hive queries. You only use LLAP if you want to improve the responsiveness of Hive queries both in interactive and batch modes. Even when you use LLAP, it’s not like every and all parts of a query are executed within LLAP. LLAP is not meant for that, that’s what query engines are for. LLAP takes parts of the queries that can benefit by using the cache or the long lived processes.
LLAP doesn’t promise anywhere that it’ll execute the entire query itself. In fact, it’s actually the query engines that orchestrate what can go into LLAP and what can’t. And for now query engine Tez and other frameworks such as Pig can use LLAP in their stack. Unfortunately, support for MapReduce is not yet planned, and don’t hold your breath on this.
And because LLAP is still built to work with YARN, the resource allocation is completely in control of YARN. This is also the reason LLAP daemon nodes are able to talk to each other and share data across nodes. Another advantage here is that the daemon processes themselves don’t require a lot of resources to function. YARN allocates the minimum required for the processes themselves and will increase the resource allocation as and when required depending on the workload. And to avoid heap issues or JVM memory issues, cached data is always kept off-heap and in large buffers. So processing of aggregations such as group by and joins will be much faster in LLAP compered to query engines.
I already mentioned that LLAP usually executes parts of queries and not usually the complete query. These parts of queries are known as query fragments. These fragments include filters, data transformations, partial aggregations, projections, sorting, bucketing, joins, semi-joins, hash joins, etc. And it is to be noted that only certain “blessed” UDFs and Hive code are accepted into LLAP.
For stability of the process and security of the data, LLAP doesn’t localise any code, and executes on the fly. And because the daemon is not tied to any particular user (as already mentioned), an LLAP node can allow parallel execution of various query fragments, across queries and sessions. This is one of the primary reasons for improved performance. Another good news, for developers specifically, is that LLAP APIs are directly available via client SDKs. You can directly specify relational transformations using these APIs.
Input and Output
As I already mentioned, the daemons themselves have a very small footprint, and that’s because everything else is mostly done by offloading the work to multiple threads. Input and output, for example, are offloaded to threads. And transformations are done in separate threads. So as soon as an I/O thread makes the data ready, the data is passed on to separate threads for processing. This makes the I/O threads available for new I/O operations.
The data is passed further in the process to other threads in run-length encoded (RLE) columnar format. This reduces copying data across threads and processes to a great extent. And by extension, caching is also in the same RLE format. You can start seeing the benefits here.
I/O and caching depend heavily on the knowledge of the file format that the data is stored in. This is necessary if I/O and caching have to be performant. So LLAP has externalized this knowledge with the help of plugins. And to start with ORC is the first file format supported by LLAP. This is one of the reasons why there is an increase in the adoption of ORC as the preferred file format for external Hive tables.
When it comes to caching, both metadata and the data itself are cached. As I mentioned in previous sections, data is cached off-heap to avoid other potential issues. But metadata, on the other hand, is stored in-process as Java objects. This makes sure that even if the data itself is evicted from the cache, the metadata is still in memory to avoid some overhead.
I touched upon data eviction in cache. And as you might expect, there are various policies for data eviction. By default, LRFU policy is used. But you can plug in any other policy at any time.
This is one big area of debate in Hive, to transact or not to transact. But that’s a topic for a separate blog post. When it comes to LLAP, it understands transactions. And it is smart enough to perform transformations (such as merging or delta files) before the data is cached. If there are various transactions performed on the same tables, which is usually the case, LLAP can store multiple versions of data for each such variation. And the correct version will be fetched from the cache on the fly depending on the query. This will make sure that the same set of transformations are not being performed on the same data over and over again, thereby reducing a lot of processing time.
Understanding LLAP doesn’t stop here. There’s a lot more to it, and the more you try to understand it, the more interesting it becomes. I’m planning to write more about it as and when I explore more. But for now, this is all I have. Understanding how LLAP works will make it a lot easier to write queries, and also to write queries in a way that can make use of these optimisations to reduce latency. I hope this helped you a slight bit with your Hadoop or Hive journey.
Become a Patron!