batilei
batilei

Reputation: 431

Why the time of Hadoop job decreases significantly when reducers reach certain number

I test the scalability of a MapReduce based algorithm with increasing number of reducers. It looks fine generally (the time decreases with increasing reducers). But the time of the job always decreases significantly when the reducer reach certain number (30 in my hadoop cluster) instead of decreasing gradually. What are the possible causes?

Something about My Hadoop Job: (1) Light Map Phase. Only a few hundred lines input. Each line will generate around five thousand key-value pairs. The whole map phase won't take more than 2 minutes. (2) Heavy Reduce Phase. Each key in the reduce function will match 1-2 thousand values. And the algorithm in reduce phase is very compute intensive. Generally the reduce phase will take around 30 minutes to be finished.

Time performance plot:

enter image description here

Upvotes: 0

Views: 181

Answers (2)

Saurabh Suman
Saurabh Suman

Reputation: 26

it should be because of high no of key-value pair. At specific no of reducers they are getting equally distributed to the reducers, which is resulting in all reducer performing the task at almost same time.Otherwise it might be the case that combiner keeps on waiting for 1 or 2 heavily loaded reducers to finish there job.

Upvotes: 0

shanmuga
shanmuga

Reputation: 4499

IMHO it could be that with sufficient number of reducers available the network IO (to transfer intermediate results) between each reduce stage decreases.
As network IO is usually the bottleneck in most Map-Reduce programs. This decrease in network IO needed will give significant improvement.

Upvotes: 0

Related Questions