Reputation: 29
I am learning hadoop map reduce using word count example , please see the diagram attached :
My questions are regarding how the parallel processing actually happens , my understanding/questions below , please correct me if i am wrong :
Let me know is this understanding right, i have a feeling i am completely off in many respects?
Upvotes: 0
Views: 4554
Reputation: 1006
Let me explain each step in little bit detail so that it will be more clear to you and I have tried to keep them as brief as possible but I would recommend you to go through offical docs (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) to get a good feel about this whole process
Split Step: if you have made some program by now you must have observed that we sometimes set a number of reducers but we never set a number of mapper because of the reason that number of mapper depend on the number of input splits. In simple words, no of mapper in any job is proportional to a number of input split. So now the question arises how do splittings take place. That actually depends on a number of factors like mapred.max.split.size which set the size of input split and there are many other ways but the fact we can control the size of input split.
Mapping Step: if by 2 processors you mean 2 JVM(2 containers) or 2 different node or 2 mapper then your intuition is wrong container or for say node have nothing to do with splitting any input file it is job of hdfs which divide and distribute the files on different nodes and then it is the responsibility of resource manager to initiate the mapper task on the same node which has the input splits if possible and once map task is initiated you can create pair of key and value according to your logic in mapper. Here one thing to remember is that one mapper can only work on one input split.
you have mixed up a little bit on step 3, step 4 and step 5. I have tried to explain those step by describing in reference with the actual classes which handles these steps.
Partitioner class: This class divide the output from mapper task according to the number of reducers. This class is useful if you have more then 1 reducer otherwise it do not effect your output. This class contain a method called getPartition which decide to which reducer your mapper output will go(if you have more than one reducer) this method is called for each key which is present in mapper output. You can override this class and subsiquently this method to customize it according to your requirment. So in case of your example since there is one reducer so it will merge output from both the mapper in a single file. If there would have been more reducer and the same number of intermediate files would have been created.
WritableComparator class : Sorting of your map output is done by this class This sorting is done on the basis of key. Like partitioner class you can override this. In your example if the key is colour name then it will sort them like this (here we are considering if you do not overide this class then it will use the default method for sorting of Text which is alphabatical order):
Black,1
Black,1
Black,1
Blue,1
Blue,1
.
.
and so on
Now this same class is also used for grouping your values according your key so that in reducer you can use iterable over them in case of your Ex ->
Black -> {1,1,1}
Blue -> {1,1,1,1,1,1}
Green -> {1,1,1,1,1}
.
.
and so on
Now there are some other implications also which effect the intermediae step between mapper and reducer and before mapper also but those are not that much relevent to what you want to know.
I Hope this solve your query.
Upvotes: 2
Reputation: 148
Your diagram is not exactly showing the basic word counting in MapReduce. Specifically, the stuff after 'Merging-step 1' is misleading in terms of understanding how MapReduce parallelize the reducing phase. The better diagram, imo, can be found at https://dzone.com/articles/word-count-hello-word-program-in-mapreduce
On the latter diagram it is easy to see that as soon as mappers' output is sorted by output key and then shuffled based on this key across the nodes with reducers then reducers can easily run in parallel.
Upvotes: 1