Partitioning output of Mappers in Hadoop

Question

This is a very basic question about Hadoop:

Suppose I have 3 mappers and 2 reducers. The mappers produced the following output:

Mapper 1 output : {1 -> "a1", 2 -> "b1"}, 
Mapper 2 output : {2 -> "b2", 3 -> "c2"}, 
Mapper 3 output : {1 -> "a3", 3 -> "c3"}

Now, as I understand, the framework partitions the output into 2 parts (a part per reducer). Does the framework sort all output before partitioning? Is it possible that the reducers get the following input ?

Reducer 1 input : {1 -> "a1", 2 -> "b1", "b2"}
Reducer 2 input : {1 -> "a3", 3 -> "c2", "c3"}

Chris White · Accepted Answer

Assuming that your notation is Key -> Value in the above then this shouldn't be possible as you have the key 1 going to both reducer 1 and reducer 2 (maybe this is typo?).

As for the ordering of operations:

K,V pairs are written to the output collector / map context (the K,V pair is serialized to a buffer in memory)
Once the size of the in memory buffer reaches a threshold, the buffer data is spilled to disk + buffer cleared
For each spill:
- The buffer is sorted by key (again in memory)
- This buffer is iterated over for each partition and the K,V pairs for that partition is written down to a spill file (a single spill file contains all paritions, in order, and some index metadata is also written about where each partition starts in the file).

So at the end of a map task, you'll have 1 or more sorted spills (sorted by partition, then key).

If you have a combiner, then the combiner may run prior to writing the K,V pairs down for that partition (if the number of pairs in that partition exceeds some threshold).

Partitioning output of Mappers in Hadoop

Answers (1)

Related Questions