Freddy
Freddy

Reputation: 174

Sort reducer input iterator value before processing in Hadoop

I have some input data coming to the reducer with the value type Iterator . How can I sort this list of values to be ascending order?

I need to sort them in order since they are time values, before processing all in the reducer.

Upvotes: 6

Views: 7218

Answers (2)

David Gruzman
David Gruzman

Reputation: 8088

What you asked for is called Secondary Sort. In a nutshell - you extend the key to add "value sort key" to it and make hadoop to group by only "real key" but sort by both.
Here is a very good explanation about the secondary sort:
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

Upvotes: 6

Eswara Reddy Adapa
Eswara Reddy Adapa

Reputation: 995

To achieve sorting of reducer input values using hadoop's built-in features,you can do this:

1.Modify map output key - Append map output key with the corresponding value.Emit this composite key and the value from map.Since hadoop uses entire key by default for sorting, map output records will be sorted by (your old key + value).

2.Although sorting is done in step 1, you have manipulated the map output key in the process.Hadoop does Partitioning and Grouping based on the key by default.

3.Since you have modified the original key, you need to take care of modifying Partitioner and GroupingComparator to work based on the old key i.e., only the first part of your composite key. Partitioner - decides which key-value pairs land in the same Reducer instance
GroupComparator - decides which key-value pairs among the ones that landed into the Reducer go to the same reduce method call.

4.Finally(and obviously) you need to extract the first part of input key in the reducer to get old key.

If you need more(and a better) answer, turn to Hadoop Definitive Guide 3rd Edition -> chapter 8 -> sorting -> secondary sort

Upvotes: 6

Related Questions