London guy
London guy

Reputation: 28022

Secondary sorting in Map-Reduce

I understood the way of sorting the values of a particular key before the key enters the reducer. I learned that it can be done by writing three methods viz, keycomparator, partitioner and valuegrouping.

Now, when valuegrouping runs, it basically groups all the values associated with the natural key, right? So when it groups all the values for the natural key, what will be the actual key that is sent along with a set of sorted values to the reducer? The natural key would have been associated with more than one type of entity (the second part of the composite key). What will be the composite key sent to the reducer?

ap

Upvotes: 2

Views: 859

Answers (1)

Chris White
Chris White

Reputation: 30089

This may be surprising to know, but each iteration of the values Iterable actually updates the key reference too:

protected void reduce(K key, Iterable<V> values, Context context) {
    for (V value : values) {
        // key object contents will update for each iteration of this loop
    }
}

I know this works for the new mapreduce API, i haven't traced it for the old mapred API.

So in answer to your question, all the keys will be available, the first key will relate to the first sorted key of the group.

EDIT: Some additional information as to how and why this works:

There are two comparators that the reducer uses to process the key/value pairs output by the map stage:

  • the key ordering comparator - This comparator is applied first and orders all the KV pairs. Conceptually you are still dealing with the serialized bytes at this stage.
  • the key group comparator - This comparator is responsible for determining when the previous and current key 'differ', denoting the boundary between one group of KV pairs and another

Under the hood, the reference to the key and value never changes, each call to Iterable.Iterator.next() advances the pointer in the underlying byte stream to the next KV pair. If the key grouper determines that the current set of keys bytes and previous set are comparatively the same key, then the hasNext method of the value Iterable.iterator() will return true, otherwise false. If true is returned, the bytes are deserialized into the Key and Value instances for consumption in your reduce method.

Upvotes: 4

Related Questions