David
David

Reputation: 18281

Hive, hadoop, and the mechanics behind hive.exec.reducers.max

In context of this other question here

Using hive.exec.reducers.max directive has truely baffled me.

From my perspective I thought hive worked on some sort of logic like, I have N # of blocks in a desired query so I need N maps. From N I will need some sensible range of reducers R which can be anywhere from R = N / 2 to R = 1. For the hive report I was working on, there was 1200+ maps and without any influence hive made a plan for about 400 reducers which was fine except I was working on a cluster that only had 70 reducers total. Even with the fair job scheduler, this caused a backlog that would hang up other jobs. So I tried a lot of different experiments until I found hive.exec.reducers.max and set it to something like 60.

The results was that a hive job that took 248 minutes, finished in 155 minutes with no changes in the result. What's bothered me is, why not have hive default to N never being greater then the clusters reducer capacity and seeing as I can roll over several terabytes of data with a reduced set of reducers then what hive thinks is correct, is it better to always try and tweak this count?

Upvotes: 3

Views: 2946

Answers (2)

Jeremy Carroll
Jeremy Carroll

Reputation: 188

In my experience with hive setting mapred.job.reuse.jvm.num.tasks to a healthy number (in my case, 8) helps with a lot of these ad-hoc queries. It takes around 20-30 seconds to spawn a JVM, so reuse can help out quite a bit with mappers and reducers that are short lived (< 30 seconds).

Upvotes: 2

chiku
chiku

Reputation: 258

You may want to look at(which talks about optimizing the number of slots): http://wiki.apache.org/hadoop/LimitingTaskSlotUsage

Here is my opinion on the same:

1) Hive would ideally try optimize the number of reducers based on the expected amount of data that gets generated after the map task. It would expect the underlying cluster to be configured to support the same.

2) Regarding whether it may not be a good idea to tweak this count or not:

  • First lets try to analyze what could be the reason for the execution time to come down from 248 minutes to 155 minutes:

Case1: Hive is using 400 reducers Problem: Only 70 reducers can run at a given point of time.

  • Assuming no JVM reuse. Creation of the JVM's again and again would add a large overhead.

  • Not sure on this: Expecting 400 reducers would cause a problem like fragmentation. As in, suppose I know that only 70 reducers can run then my intermediate file storing strategy would be depend on that. But, with 400 reducers the whole strategy goes for a toss.

Case2: Hive is using 70 reducers - Both the problems get addressed by setting this number.

I guess its better to set the number of maximum available reducers. But, I am no expert at this. Would let the experts comment on this.

Upvotes: 2

Related Questions