Reputation: 3315
So this has always confused me. I'm not sure exactly how map-reduce works and I seem to get lost in the exact chain of events.
My understanding:
I think to sum up, I just dont get how the files are being re-combined properly and its causing my map-reduce logic to fail.
Upvotes: 1
Views: 900
Reputation: 2682
Before one reads this answer , please take some timeout to read about merge-sort (divide and conquer approach )
Here is the complete set of actions happening behind the scene by the framework
JobTracker figures out where are the splits are located and spawns mappers close to the split, the priority of locality is (1. data local , 2. rack local , 3. network hop local )
Mappers read the data (Record Readers provided by FileInputFormat) and produce k1->v1
This data is locally saved into the local filesystem where the mappers are running ,the trick here is the data saved on localfilesystem is "SORTED" and stored in partitions (equal to number of reducers)
5 . Each reducer pulls data from the mappers from their corresponding partitions (Dont forget all the data pulled by the reducer is sorted)
{
k1->v1
k1->v2
K2->v3
}
Reducer opens up filepointers to all the sorted files pulled from mappers and merges them . (Group and sort comparator are used while merging) As the merging is happening from sorted files , the output of the reducer is sorted and saved onto hdfs
This step is somewhat similar to merge sort "merge step"
Upvotes: 1
Reputation: 5239
Step 3 is called the "shuffle". It's one of the main value-adds of map reduce frameworks, although it's also very expensive for large datasets. The framework does something akin to a GROUP BY operation on the complete set of records output by all the mappers, and then reducers are called with each group of records. To answer your individual questions for 3:
3.1. Imagine that your job is configured to have r total reducers. The framework carves up every one of the map output files into r pieces and sends each piece to one reducer task. With m total mappers, that is mr little slices flying around. When a particular reducer has received all the slices it needs, it merges them all together and sorts the result by the K2 key, and then groups records on the fly by that key for individual calls to reduce(). If there are duplicate K2 keys, the group will be larger than a singleton. In fact, that is the point. If your mappers don't ever output identical keys, then your algorithm should not even need a reduce phase and you can skip the expensive shuffle altogether.
3.2. The load of doing all that data movement is spread across the whole cluster because each reducer task knows what outputs it wants and asks for them from each mapper. The only thing the master node has to do is coordinate, i.e., tell each reducer when to start pulling mapper outputs, watch for dead nodes, and keep track of everyone's progress.
3.3. Reducer output is not examined by the framework or combined in any way. However many reducer tasks you have (r), that's how many output files with K3, V3 records in them you will get. If you need that combined again, run another job on that output.
Upvotes: 2