Reputation: 586
Started using Hadoop recently and struggling to make sense of a few things. Here is a basic WordCount example that I'm looking at (count the number of times each word appears):
Map(String docid, String text):
for each word term in text:
Emit(term, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
Firstly, what is Emit(w,1)
supposed to be doing? I notice that in all of the examples I look at the second parameter is always set to 1, but I can't seem to find an explanation on it.
Also, just to clarify - am I correct in saying that term is the key and sum in Reduce form the key-value pairs (respectively)? If this is the case, is values simply a list of 1's for each term that got emitted from Map? That's the only way I can make sense of it, but these are just assumptions.
Apologies for the noob question, I have looked at tutorials but a lot of the time I find that a lot of confusing terminology is used and overall basic things are made to be more complicated than they actually are so I'm struggling a little to make sense of this.
Appreciate any help!
Upvotes: 2
Views: 766
Reputation: 191884
Take this input as an example word count input.
Mapper will split this sentence into words.
Take,1
this,1
input,1
as,1
an,1
example,1
word,1
count,1
input,1
Then, the reducer receives "groups" of the same word (or key) and lists of the grouped values like so (and additionally sorts the keys, but that's not important for this example)
Take, (1)
this, (1)
input (1, 1)
etc...
As you can see, the key input
has been "reduced" into a single element, that you can loop over and sum the values and emit like so
Take,1
this,1
input,2
etc...
Upvotes: 3
Reputation: 4845
Good question.
As explained, the mapper outputs a sequence of (key, value)
pairs, in this case of the form (word, 1)
for each word, which the reducer receives grouped as (key, <1,1,...,1>)
, sums up the terms in the list and returns (key, sum)
. Note that it is not the reducer who does the grouping; it's the map-reduce environment.
The map-reduce programming model is different from the one we're used to working in, and it's often not obvious how to implement an algorithm in this model. (Think, for example, about how would you implement a k-means clustering.)
I recommend Chapter 2 of the freely-available Mining of Massive Data Sets book by Leskovec et al. See also the corresponding slides.
Upvotes: 2