How does map reduce parallel processing really work in hadoop with respect to the word count example?

Question

I am learning hadoop map reduce using word count example , please see the diagram attached :

My questions are regarding how the parallel processing actually happens , my understanding/questions below , please correct me if i am wrong :

Split step : This assigns number of mappers , here the two data sets go to two different processor [p1,p2] , so two mappers ? This splitting is done by first processor P.
Mapping Step : Each of these processor [p1,p2] now divides the data into key value pairs by applying required function f() on keys which produces value v giving [k1,v1],[k2,v2].
Merge Step 1 : Within each processor , values are grouped by key giving [k1,[v1,v2,v3]].
Merge Step 2 : Now p1,p2 returns output to P which merges both the resultant key value pairs. This happens in P.
Sorting Step : Now here P , will sort all the results.
Reduce Step : Here P will apply f() on each individual keys [k1,[v1,v2,v3]] to give [k1,V]

Let me know is this understanding right, i have a feeling i am completely off in many respects?

siddhartha jain · Accepted Answer

Let me explain each step in little bit detail so that it will be more clear to you and I have tried to keep them as brief as possible but I would recommend you to go through offical docs (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) to get a good feel about this whole process

Split Step: if you have made some program by now you must have observed that we sometimes set a number of reducers but we never set a number of mapper because of the reason that number of mapper depend on the number of input splits. In simple words, no of mapper in any job is proportional to a number of input split. So now the question arises how do splittings take place. That actually depends on a number of factors like mapred.max.split.size which set the size of input split and there are many other ways but the fact we can control the size of input split.
Mapping Step: if by 2 processors you mean 2 JVM(2 containers) or 2 different node or 2 mapper then your intuition is wrong container or for say node have nothing to do with splitting any input file it is job of hdfs which divide and distribute the files on different nodes and then it is the responsibility of resource manager to initiate the mapper task on the same node which has the input splits if possible and once map task is initiated you can create pair of key and value according to your logic in mapper. Here one thing to remember is that one mapper can only work on one input split.

you have mixed up a little bit on step 3, step 4 and step 5. I have tried to explain those step by describing in reference with the actual classes which handles these steps.

Partitioner class: This class divide the output from mapper task according to the number of reducers. This class is useful if you have more then 1 reducer otherwise it do not effect your output. This class contain a method called getPartition which decide to which reducer your mapper output will go(if you have more than one reducer) this method is called for each key which is present in mapper output. You can override this class and subsiquently this method to customize it according to your requirment. So in case of your example since there is one reducer so it will merge output from both the mapper in a single file. If there would have been more reducer and the same number of intermediate files would have been created.
WritableComparator class : Sorting of your map output is done by this class This sorting is done on the basis of key. Like partitioner class you can override this. In your example if the key is colour name then it will sort them like this (here we are considering if you do not overide this class then it will use the default method for sorting of Text which is alphabatical order):


    Black,1
    Black,1
    Black,1
    Blue,1
    Blue,1
    .
    .
    and so on

Now this same class is also used for grouping your values according your key so that in reducer you can use iterable over them in case of your Ex ->

Black -> {1,1,1}   
Blue -> {1,1,1,1,1,1}
Green -> {1,1,1,1,1}
.
.
and so on

Reducer -> This step will simply reduce your map accroding to the logic define in your reducer class. you initution is appropriate for this class.

Now there are some other implications also which effect the intermediae step between mapper and reducer and before mapper also but those are not that much relevent to what you want to know.

I Hope this solve your query.

How does map reduce parallel processing really work in hadoop with respect to the word count example?

Answers (2)

Related Questions