Reputation: 2097
I am newer to hadoop. I have problem in understanding MapReduce when it is in a cluster environment.
Take the word count example code Assume I have three nodes, and each have a map tasks. After Map: Machine A:
hello 1
word 1
data 1
...
Machine B:
hello 1
xu 2
...
The Map's output are saved in local files in and machines. My question is how these datas across multiple machines are merged before passing to the reduce stage? For example, the reduce statge receives
hello <1, 1>
xu 1
Upvotes: 0
Views: 184
Reputation: 148
A mapper output is locally sorted by key (by word in your case), then it is partitioned into several chunks (number of chunks is equal to number of reducers or less if this particular mapper output has no keys for certain reducers). After that each chunk goes to a corresponding reducer (which also receives data pieces from the rest of mappers) where it is merged with the other chunks coming from the other mappers and then it all goes as an input to reducer.
Upvotes: 0
Reputation: 1006
Once the Map task is done for a job then the output is saved and then is transferred to Partitioner class this class is responsible for separating data according to the reducers. for example, in your case, you have 3 machines you are running 2 reducers.Then getpartition() method of partitioner class is responsible for dividing the map output for that 2 reducer Ex-> hello 1 //reducer 1 word 1 // reducer 2 data 1 //reducer 1
So now 2 separated files will be created one for each reducer.No of these files created on each mapper node depends on whether the map output contains data for each reducer or not and remember till now all these files still are on mapper node.
After this WritableComapartor class is called which is responsible for sorting the data in each of 2 files and it is also responsible for grouping them. Once this is done resultant files are ready to be sent across to respective nodes in cluster.
After this shuffling and sorting is occur in which all the map node send the resultant output files on respective reducer node then on the reducer the files received from all the mapper is merged and sort Ex -> so it there are 2 mapper and 2 reducers and one mapper generate a data for both reducer 1 and reducer 2 other generated only one output file which is for reducer 1 then reducer 1 will get two files and reducer 2 will get 1 file.
After merging and sorting Reducer is run over these files and final output is generated.
Refer here for more detail about data flow from mapper to reducer
Upvotes: 1
Reputation: 8937
Machine A: hello 1, word 1, data 1
Machine B: hello 1, xu 2
Reducer input: data {1}, hello {1,1}, word {1}, xu {2}
See more detail about MapReduce in this article
Upvotes: 1