Reputation: 80176
I understand why the intermediate key values are grouped by key but why to sort them?
Upvotes: 5
Views: 420
Reputation: 33495
According to Google's Paper on MapReduce
We guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when the output file format needs to support efficient random access lookups by key, or users of the output find it convenient to have the data sorted.
And Hadoop has been implemented based on the Google's paper. Not all algorithms require data to be sorted. Sorting has been pluggable in Hadoop and alternates can be used. More info here.
Upvotes: 1
Reputation: 39893
That's how it is implementing the grouping. When you sort by the keys, they are grouped together. It really doesn't matter that it's sorted... it only matters that the keys that are equal are right next to each other.
It is possible that sorting isn't the best approach. Maybe some sort of hashing would be faster: O(N) instead of O(NlogN). It was implemented as sort just because there are some applications that want sorted keys (HBase/BigTable for example).
A pluggable sort has been worked on recently and is available in a beta. I haven't had the chance to try it out yet. http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html
Upvotes: 1