Why to SORT the intermediate keys generated in map reduce?

Question

I understand why the intermediate key values are grouped by key but why to sort them?

Donald Miner · Accepted Answer

That's how it is implementing the grouping. When you sort by the keys, they are grouped together. It really doesn't matter that it's sorted... it only matters that the keys that are equal are right next to each other.

It is possible that sorting isn't the best approach. Maybe some sort of hashing would be faster: O(N) instead of O(NlogN). It was implemented as sort just because there are some applications that want sorted keys (HBase/BigTable for example).

A pluggable sort has been worked on recently and is available in a beta. I haven't had the chance to try it out yet. http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html

Why to SORT the intermediate keys generated in map reduce?

Answers (2)

Related Questions