user3484461
user3484461

Reputation: 1143

Sorting in hadoop framework

I have tried implementing secondary sort. so i have a question related to that :

Sorting happens 3 times in Hadoop framework 

 1) Sorting in Buffer ( Sorting occur based on key of a map function)
 2) Sorting during merging of spill files of mapper output( ?????????????)
 3) Sorting at Reducer side when reducer gets map output from various mapper based on partition logic again merging happens .( Sorting occur based on Sort Comparator )

if my above understanding is correct, Then based on what logic sorting occurs during spill files merging on map output files ,it it based on keys that we use in map function or sort comparator on which reduce side sorting happen and why ?

Upvotes: 4

Views: 750

Answers (1)

Srinivasarao Daruna
Srinivasarao Daruna

Reputation: 3374

To answer precisely, in the buffer, the values are ordered based on the keys, where as at the reducer they will be compared using comparator.

This is how the sort at map end happens. Each map task has a circular memory buffer that it writes the output to. When the contents of the buffer reaches a certain threshold size ,a background thread will start to spill the contents to disk.

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

The final order at the reducer will be done by comparing each key to other one, which is nothing but a comparator.

To examine this, I have written a ReverseIntWritable, which will order in reverse to IntWritable and i have written the output same way from mapper and reducer.

If i have not used reducer, the input {(1, xyz), (2,ijk)} come out as {(1, xyz), (2,ijk)}. If i have used reducer, the output for the same input came out as {(2,ijk) , (1, xyz) }.

Hope this helps..

Upvotes: 2

Related Questions