Samriang
Samriang

Reputation: 443

Tez. Slow reducers

I have strange behavior with TEZ mapreduce job.

I'm trying to read logs data from Hive, split it into some chunks by id, date and some other parameters and then write to another hive tables.

Map phase works fast enough and takes about 20 minutes, than reducers start to work and 453 from 458 reducers process all data within next 20 minutes. But last 5 reducers work for about 1 hour.

It happens because my input data includes some huge entries and processing these entries takes a lot of time. What is the best practice for such cases? Should I make some hadoop/tez/hive tuning to allow kind of parallel processing for last reducers or it would be smarter to split input data by other parameters to avoid huge entries?

Thanks for any advice.

Upvotes: 1

Views: 3390

Answers (1)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

The magic word behind that not-so-strange behavior is skew. And it's a veeeery common issue. Usually people prefer ignoring the problem... until they really feel the pain (just like you do now).

With TEZ, since HIVE-7158 Use Tez auto-parallelism in Hive you can try to tinker with some specific properties:

hive.tez.auto.reducer.parallelism
hive.tez.max.partition.factor
hive.tez.min.partition.factor

But that "auto-parallelism" feature seems to apply when you have several abnormally small reduce datasets that can be merged, while your problem is the exact opposite (one abnormally large reduce dataset). So you should try also to tinker with

hive.exec.reducers.bytes.per.reducer
hive.exec.reducers.max 

...to change the scale and make "large" the new "normal" (hence "normal" becoming "small"). But then, maybe all you will get will be 3 reducers all taking 1 hour to complete. Hard to say.

Good luck. This kind of performance tuning is more Art than Science.

Reference :

~~~~~~

PS: of course, if you could remove the source of skewness by changing the way you organize your input dataset...

Upvotes: 5

Related Questions