Reputation: 39
I have been asked to modify the WordCount example so that each mapper function sums the occurences of words in its file together before passing it on. So for instance, instead of:
<help,1>
<you,1>
<help,1>
<me,1>
The output of the mapper would be:
<help,2>
<you,1>
<me,1>
So would I add the word to an array, then check for occurrences? Or is there a simpler way?
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
Upvotes: 2
Views: 777
Reputation: 7462
You can define a Java Map structure, or a Guava Multiset, and count the occurences of each word for each Mapper. Then, when the mapper ends, the cleanup method, which runs afterwards, can emit all the partial sums as the output of map, like that (pseudocode):
setup() {
Map<String,Integer> counts = new HashMap<>();
}
map() {
for each word w {
counts.put(w, counts.get(w)+1); //or 1, if counts.get(w) returns null
}
}
cleanup() {
for each key w of counts.keySet {
context.write(w, counts.get(w));
}
}
Quoting the Mapper's documentation (version 2.6.2):
The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapper implementations can access the Configuration for the job via the JobContext.getConfiguration().
The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(Context) is called.
Other than that, you can also consider using a Combiner as an alternative.
Upvotes: 1