Reputation: 137
I understand from When do reduce tasks start in Hadoop that the reduce task in hadoop contains three steps: shuffle, sort and reduce where the sort (and after that the reduce) can only start once all the mappers are done. Is there a way to start the sort and reduce every time a mapper finishes.
For example lets we have only one job with mappers mapperA and mapperB and 2 reducers. What i want to do is:
Is this possible? Thanks
Upvotes: 4
Views: 1095
Reputation: 4575
You can't with the current implementation. However, people have "hacked" the Hadoop code to do what you want to do.
In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first.
However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at theorg.apache.hadoop.mapred.ReduceTask.ReduceCopier
class, which implements ShuffleConsumerPlugin
.
EDIT: Finally, as @teo points out in this related SO question, the
ReduceCopier.fetchOutputs()
method is the one that holds the reduce task from running until all map outputs are copied (through the while loop in line 2026 of Hadoop release 1.0.4).
Upvotes: 3
Reputation: 22925
Starting the sort process before all mappers finish is sort of a hadoop-antipattern (if I may put it that way!), in that the reducers cannot know that there is no more data to receive until all mappers finish. you, the invoker may know that, based on your definition of keys, partitioner etc., but the reducers don't.
Upvotes: 1
Reputation: 30089
You can configure this using the slowstart property, which denotes the percentage of your mappers that need to be finished before the copy to the reducers starts. It usual default is in the 0.9 - 0.95 (90-95%) mark, but you can override to be 0 if your want
`mapreduce.reduce.slowstart.completed.map`
Upvotes: 2