Trying to make sense of a basic WordCount MapReduce example

Question

Started using Hadoop recently and struggling to make sense of a few things. Here is a basic WordCount example that I'm looking at (count the number of times each word appears):

Map(String docid, String text):
 for each word term in text:
 Emit(term, 1);

Reduce(String term, Iterator values):
 int sum = 0;
 for each v in values:
 sum += v;
 Emit(term, sum);

Firstly, what is Emit(w,1) supposed to be doing? I notice that in all of the examples I look at the second parameter is always set to 1, but I can't seem to find an explanation on it.

Also, just to clarify - am I correct in saying that term is the key and sum in Reduce form the key-value pairs (respectively)? If this is the case, is values simply a list of 1's for each term that got emitted from Map? That's the only way I can make sense of it, but these are just assumptions.

Apologies for the noob question, I have looked at tutorials but a lot of the time I find that a lot of confusing terminology is used and overall basic things are made to be more complicated than they actually are so I'm struggling a little to make sense of this.

Appreciate any help!

OneCricketeer · Accepted Answer

Take this input as an example word count input.

Mapper will split this sentence into words.

Take,1
this,1
input,1
as,1
an,1
example,1
word,1
count,1
input,1

Then, the reducer receives "groups" of the same word (or key) and lists of the grouped values like so (and additionally sorts the keys, but that's not important for this example)

Take, (1)
this, (1)
input (1, 1)
etc...

As you can see, the key input has been "reduced" into a single element, that you can loop over and sum the values and emit like so

Take,1
this,1
input,2 
etc...

Trying to make sense of a basic WordCount MapReduce example

Answers (2)

Related Questions