Reputation: 431
I test the scalability of a MapReduce based algorithm with increasing number of reducers. It looks fine generally (the time decreases with increasing reducers). But the time of the job always decreases significantly when the reducer reach certain number (30 in my hadoop cluster) instead of decreasing gradually. What are the possible causes?
Something about My Hadoop Job: (1) Light Map Phase. Only a few hundred lines input. Each line will generate around five thousand key-value pairs. The whole map phase won't take more than 2 minutes. (2) Heavy Reduce Phase. Each key in the reduce function will match 1-2 thousand values. And the algorithm in reduce phase is very compute intensive. Generally the reduce phase will take around 30 minutes to be finished.
Time performance plot:
Upvotes: 0
Views: 181
Reputation: 26
it should be because of high no of key-value pair. At specific no of reducers they are getting equally distributed to the reducers, which is resulting in all reducer performing the task at almost same time.Otherwise it might be the case that combiner keeps on waiting for 1 or 2 heavily loaded reducers to finish there job.
Upvotes: 0
Reputation: 4499
IMHO it could be that with sufficient number of reducers
available the network IO (to transfer intermediate results) between each reduce stage decreases.
As network IO is usually the bottleneck in most Map-Reduce programs. This decrease in network IO needed will give significant improvement.
Upvotes: 0