Reputation: 2097

How data are merged in cluster map reduce environment

I am newer to hadoop. I have problem in understanding MapReduce when it is in a cluster environment.

Take the word count example code Assume I have three nodes, and each have a map tasks. After Map: Machine A:

hello 1
word  1
data  1
...

Machine B:

hello 1
xu  2
...

The Map's output are saved in local files in and machines. My question is how these datas across multiple machines are merged before passing to the reduce stage? For example, the reduce statge receives

hello <1, 1>
xu 1

Upvotes: 0

Answers (3)

Denis

Reputation: 148

A mapper output is locally sorted by key (by word in your case), then it is partitioned into several chunks (number of chunks is equal to number of reducers or less if this particular mapper output has no keys for certain reducers). After that each chunk goes to a corresponding reducer (which also receives data pieces from the rest of mappers) where it is merged with the other chunks coming from the other mappers and then it all goes as an input to reducer.

Upvotes: 0

siddhartha jain

Reputation: 1006

Once the Map task is done for a job then the output is saved and then is transferred to Partitioner class this class is responsible for separating data according to the reducers. for example, in your case, you have 3 machines you are running 2 reducers.Then getpartition() method of partitioner class is responsible for dividing the map output for that 2 reducer Ex-> hello 1 //reducer 1 word 1 // reducer 2 data 1 //reducer 1

So now 2 separated files will be created one for each reducer.No of these files created on each mapper node depends on whether the map output contains data for each reducer or not and remember till now all these files still are on mapper node.

After this WritableComapartor class is called which is responsible for sorting the data in each of 2 files and it is also responsible for grouping them. Once this is done resultant files are ready to be sent across to respective nodes in cluster.

After this shuffling and sorting is occur in which all the map node send the resultant output files on respective reducer node then on the reducer the files received from all the mapper is merged and sort Ex -> so it there are 2 mapper and 2 reducers and one mapper generate a data for both reducer 1 and reducer 2 other generated only one output file which is for reducer 1 then reducer 1 will get two files and reducer 2 will get 1 file.

After merging and sorting Reducer is run over these files and final output is generated.

Refer here for more detail about data flow from mapper to reducer

Upvotes: 1

Alex

Reputation: 8937

In this example Mapper gets the average value from the last cell of every line of input data. It does not count words so keep in mind that you will not get such word count output data from your local dataset using this mapper;
Before starting your reduce stage MR framework will group the output of every mapper of every node into a single sorted by the key dataset. Eventually it will be split into a set of reduce jobs where you define you reduce logic.
For your particular case, as I've mentioned in previous point, all outputs would be grouped by the key which will relate to the first word of your output:

Machine A: hello 1, word 1, data 1

Machine B: hello 1, xu 2

Reducer input: data {1}, hello {1,1}, word {1}, xu {2}

See more detail about MapReduce in this article

Upvotes: 1

How data are merged in cluster map reduce environment

Answers (3)

Related Questions