Reputation: 5745
"MapReduce Design Patterns" book has pattern for finding distinct records in dataset. This is the algorithm:
map(key, record):
emit record, null
reduce(key, records):
emit key
On page 66 it says:
The Combiner can always be utilized in this pattern and can help if there are a large number of duplicates.
map phase emits record and NullWritable
(which does not written on the wire). What Combiner
tries to reduce? There is no record to reduce.
Upvotes: 1
Views: 529
Reputation: 20969
It tries to reduce the duplicates in a map output.
Let's say you have text data of words in every line:
John
Adam
John
John
There is no point in sending every John
to the reducer if you can combine them after the map phase and only send:
John
Adam
Which is distinct for each mapper already- thus saves bandwidth if you have a fair amount of non-distinct records in your split.
Upvotes: 2