Aravind Yarram
Aravind Yarram

Reputation: 80176

Why to SORT the intermediate keys generated in map reduce?

I understand why the intermediate key values are grouped by key but why to sort them?

Upvotes: 5

Views: 420

Answers (2)

Praveen Sripati
Praveen Sripati

Reputation: 33495

According to Google's Paper on MapReduce

We guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when the output file format needs to support efficient random access lookups by key, or users of the output find it convenient to have the data sorted.

And Hadoop has been implemented based on the Google's paper. Not all algorithms require data to be sorted. Sorting has been pluggable in Hadoop and alternates can be used. More info here.

Upvotes: 1

Donald Miner
Donald Miner

Reputation: 39893

That's how it is implementing the grouping. When you sort by the keys, they are grouped together. It really doesn't matter that it's sorted... it only matters that the keys that are equal are right next to each other.

It is possible that sorting isn't the best approach. Maybe some sort of hashing would be faster: O(N) instead of O(NlogN). It was implemented as sort just because there are some applications that want sorted keys (HBase/BigTable for example).

A pluggable sort has been worked on recently and is available in a beta. I haven't had the chance to try it out yet. http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html

Upvotes: 1

Related Questions