HHH
HHH

Reputation: 6465

How secondary sort works in Hadoop?

I understand that in a secondary sort we can have a user-defined class as the key. This class can have two attributes for example, the pairs can be grouped according to the first (natural key) attribute and then get sorted based on the second attribute (secondary key). My question is, the key objects have different values for their second attribute (secondary key). So the reducer can not receive a single key. That is, the reducer should receive a list of keys since each key has a different value for its second attribute (secondary key). Is that right?

Here is the key class

public class KeyClass extends Configured implements WritableComparable<KeyClass >{

   public boolean secondary;
   public String primary;
    ...

}

Upvotes: 1

Views: 314

Answers (2)

Niels Basjes
Niels Basjes

Reputation: 10642

Yes, you are correct. You should get a list of keys but you don't (not in the sense of a List).

The last time I played with secondary sort (long time ago) I found that when I got the next value (i.e. call the .next() on the iterator) the instance of the key is also changed by the framework.

This sounds really weird and that's why I remember it.

Please verify if this is still true in the Hadoop version you are working with.

Upvotes: 0

Chris Gerken
Chris Gerken

Reputation: 16392

The reduce gets a single key and a list (an Iterable) of values. The key you get is associated with one of the values in the list. If you want to access the secondary key (that part of the composite key value that is changing across the list of values), then you should put that secondary key in the value, too.

Upvotes: 1

Related Questions