Optimize multiple Hive QL in Oozie

Question

I am not familiar enough with hive, so here I am. We are using Oozie to chain a bunch of hive ql jobs together. I was tasked to optimize an application that has already been running in our production environment. The Business Partners don't want it to take longer than like 1.5 hours. One of the first things I noticed was that there are around 90 oozie actions within this one work flow. We also share a yarn queue with other applications. Half of those actions are hive2 actions, and each of the Hive QL actions only do one HQL statement. It seems sometimes there are delays between HiveQL actions because the Oozie launcher waits in a queue, and then the HiveQL waits in a queue. Is that normal? Are there ways around this.

For time sensitive Hive Queries: 1) Is Oozie the right tool we should use for chaining time sensitive HiveQL scripts together 2) What are some Alternatives ( Could using Java or Python to launch and handle the flow between HQL have a performance benefit)? 3) Is there something we can do in HQL itself? (Again, I'm new to hive, primarily experience with MapReduce/Spark and simple workflows (less than 20 actions) 4) Are there any other performance considerations that I didn't mention?

Thank you,

Samson Scharfrichter · Accepted Answer

the Oozie launcher waits in a queue, and then the HiveQL waits in a queue.

Oozie does not run anything by itself. It starts a Launcher first -- a dummy YARN job (1 AppMaster + 1 Mapper) -- just to run the base command (Hive CLI fat client for "hive", Beeline thin client for "hive2", Pig CLI, Sqoop, Spark Driver, Bash shell, etc.) Then, that command may spawn a series of YARN jobs.

Be aware that YARN is not aware of the dependencies between the Launcher and its spawned jobs. Especially in the case of "hive2" action, because the launcher connects to HiveServer2 and it's HiveServer2 that spawns the jobs!

Advice #1 - the Launcher job requires very little coordination (just 1 Mapper, remember) so its AppMaster resources should be set rather low, to avoid consuming too much RAM and therefore blocking the queues. You can override the default settings with the (unfortunately not documented) action properties oozie.launcher.yarn.app.mapreduce.am.resource.mb (total RAM) and oozie.launcher.yarn.app.mapreduce.am.command-opts (explicit quota for Java Heap size with "-Xmx" parameter, typically 80% of RAM - too low and you get OutOfMemory errors, too high and YARN may kill your container because of quota misuse)

Advice #2 - for "hive2" the Launcher job requires very little resources too (Beeline is a thin JDBC client) so blah blah oozie.launcher.mapreduce.map.memory.mb and oozie.launcher.mapreduce.map.java.opts blah blah.

Advice #3 - if you can get access to a higher-priority YARN queue (as advised by Biswajit Nayak) then use it with oozie.launcher.mapreduce.job.queuename for the Launcher. For the actual Hive queries, it depends:

with "hive" only, you can also set mapreduce.job.queuename in Oozie action
with "hive" or "hive2", you can insert command set mapreduce.job.queuename = *** ; at the top of the HQL script

Advice #4 - if the default AM resources seem to be oversized for your Hive queries, you can also try to resize them

with "hive" only, you can set yarn.app.mapreduce.am.resource.mb and yarn.app.mapreduce.am.command-opts in Oozie action - or possibly tez.am.resource.memory.mb and tez.am.launch.cmd-opts when using TEZ
with "hive" or "hive2", you can insert commands blah blah blah on top of the HQL script

Caveat for #1-2-4: you cannot request less than yarn.scheduler.minimum-allocation-mb (and it's set for the ResourceManager service, you can't override that one on a job-per-job basis).

Are there any other performance considerations

Advice #5 - if some steps can be chained in the same HQL script, it will reduce the overhead of Oozie polling YARN to detect the end of first query, then starting another Launcher, then the Launcher starting another Hive session. Of course, in case of error, the execution control will not be fine-grained and maybe some manual clean-up will be required before restart.

Advice #6 - if some steps can be done in parallel, and you actually have enough YARN resources to run them in parallel, then place them in different branches of an Oozie Fork/Join (as advised by Biswajit Nayak).

Advice #7 - if you don't already use TEZ, give it a try. Can be tricky to find a good set of parameters for your cluster, but when it works, it's way more efficient than MapReduce in many cases (i.e. it re-uses the same YARN containers for Map and Reduce steps, and even for successive queries - less YARN overhead, less intermediate disk I/O, etc.)

~~~~~~~~

By the way, do you see any good reason of using the older "hive" action in some places? Maybe there are options to force "local mode", i.e. skip YARN and run small queries inside the Launcher with no extra overhead? Or maybe they wanted the verbose logs?

Optimize multiple Hive QL in Oozie

Answers (2)

Related Questions