I understand from When do reduce tasks start in Hadoop that the reduce task in hadoop contains three steps: shuffle, sort and reduce where the sort (and after that the reduce) can only start once all the mappers are done. Is there a way to start the sort and reduce every time a mapper finishes. For example lets we have only one job with mappers mapperA and mapperB and 2 reducers. What i want to do is: mapperA finishes shuffles copies the appropriate partitions of the mapperAs output lets say to reducer 1 and 2 sort on reducer 1 and 2 starts sorting and reducing and generates some intermediate output now mapperB finishes shuffles copies the appropriate partitions of the mapperBs output to reducer 1 and 2 sort and reduce on reducer 1 and 2 starts again and the reducer merges the new output with the old one Is this possible? Thanks

Reputation: 137

how to start sort and reduce in hadoop before shuffle completes for all mappers?

I understand from When do reduce tasks start in Hadoop that the reduce task in hadoop contains three steps: shuffle, sort and reduce where the sort (and after that the reduce) can only start once all the mappers are done. Is there a way to start the sort and reduce every time a mapper finishes.

For example lets we have only one job with mappers mapperA and mapperB and 2 reducers. What i want to do is:

mapperA finishes
shuffles copies the appropriate partitions of the mapperAs output lets say to reducer 1 and 2
sort on reducer 1 and 2 starts sorting and reducing and generates some intermediate output
now mapperB finishes
shuffles copies the appropriate partitions of the mapperBs output to reducer 1 and 2
sort and reduce on reducer 1 and 2 starts again and the reducer merges the new output with the old one

Is this possible? Thanks

Upvotes: 4

Answers (3)

cabad

Reputation: 4575

You can't with the current implementation. However, people have "hacked" the Hadoop code to do what you want to do.

In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first.

However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at theorg.apache.hadoop.mapred.ReduceTask.ReduceCopier class, which implements ShuffleConsumerPlugin.

EDIT: Finally, as @teo points out in this related SO question, the

ReduceCopier.fetchOutputs() method is the one that holds the reduce task from running until all map outputs are copied (through the while loop in line 2026 of Hadoop release 1.0.4).

Upvotes: 3

davek

Reputation: 22925

Starting the sort process before all mappers finish is sort of a hadoop-antipattern (if I may put it that way!), in that the reducers cannot know that there is no more data to receive until all mappers finish. you, the invoker may know that, based on your definition of keys, partitioner etc., but the reducers don't.

Upvotes: 1

Chris White

Reputation: 30089

You can configure this using the slowstart property, which denotes the percentage of your mappers that need to be finished before the copy to the reducers starts. It usual default is in the 0.9 - 0.95 (90-95%) mark, but you can override to be 0 if your want

`mapreduce.reduce.slowstart.completed.map`

Upvotes: 2

how to start sort and reduce in hadoop before shuffle completes for all mappers?

Answers (3)

Related Questions