How to control the sort order of mapper result in mapreduce before being sent to reducer

Question

Taking a slight variation of the word count example to explain what I am trying to do.

I have 3 mappers each producing a complete word count result on 3 large input files. Let us say the output is:

Mapper 1 Result:
-------
cat 100
dog 50
fox 10

Mapper 2 Result:
-------
fox 200
pig 5
rat 1

Mapper 3 Result:
-------
dog 70
rat 50
fox 10

Notice that each result is a complete word count with unique key,count results for given files.

Now on the reducer side my algorithm requires that there be only one reducer, and for reasons that are a bit too lengthy to discuss here, I want the results from each mapper to be fed into reducer in the descending order of counts but without performing any shuffle and sort step. i.e. I like the reducer to receive the results from each mapper in the following order without any grouping by key:

cat 100
dog 50
fox 10

fox 200
pig 5
rat 1

dog 70
rat 50
fox 10

i.e. just load the results of each mapper into reducer in the descending order of value(not key)

John B · Accepted Answer

Seems like this should be a Map-only job since you don't want Shuffle and Sort to happen.

If you REALLY need to use Reduce then I suggest you need to have a composite key and do secondary sort.

The key would include a mapper id, normal key and the count value. You would do primary sort on mapper id and secondary sort on count. You would also need a grouping comparator that did not group anything (or grouped on mapper id and normal key only).

Again, looking at all the stuff you would need to do to use a Reducer just to prevent Shuffle and Sort, seems like this should be a Map-only job unless the output must be in a single file.

How to control the sort order of mapper result in mapreduce before being sent to reducer

Answers (1)

Related Questions