Selecting distinct records in Hadoop and using combiner

Question

"MapReduce Design Patterns" book has pattern for finding distinct records in dataset. This is the algorithm:

map(key, record):
    emit record, null

reduce(key, records):
    emit key

On page 66 it says:

The Combiner can always be utilized in this pattern and can help if there are a large number of duplicates.

map phase emits record and NullWritable(which does not written on the wire). What Combiner tries to reduce? There is no record to reduce.

Thomas Jungblut · Accepted Answer

It tries to reduce the duplicates in a map output.

Let's say you have text data of words in every line:

John
Adam
John
John

There is no point in sending every John to the reducer if you can combine them after the map phase and only send:

John
Adam

Which is distinct for each mapper already- thus saves bandwidth if you have a fair amount of non-distinct records in your split.

Answers (1)