Reputation: 359
I'm trying to tweak the wordcount example found here: http://wiki.apache.org/hadoop/WordCount so it will sum and return the number of words in the input file instead of counting the occurrences of each word.
I tried changing the mapper class in a way that instead of writing the word in the current iteration, it will write "Sum: " for all words.
i.e. replace
word.set(tokenizer.nextToken());
@class "Map" with
word.set("Sum: ");
All the rest of the file remains the same.
In that way I thought all mappers output get to the same reducer that will eventually sum the number of "sum: "'s which will eventually be the number of words in the file.
meaning instead of:
word 1
other 1
other 1
that yields:
word 1
other 2
I was expecting to have:
Sum: 1
Sum: 1
Sum: 1
that yields:
Sum: 3
Instead, when I try and run the code, I get a really long mapping operation that eventually end up with an exeption thrown:
RuntimeException: java.io.IOException: Spill failed
no matter how small the input file is.
Looking forward for your help. Thank you
Upvotes: 2
Views: 410
Reputation: 3109
You have an endless loop. In your code, you need to call
tokenizer.nextToken()
to actually advance the StringTokenizer by a word from the line. Otherwise your mapping operation will never make progress.
So you would need something like this:
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text sumText = new Text("Sum: ");
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
tokenizer.nextToken(); //go to next word
context.write(sumText, one);
}
}
}
However, there is a better solution without a loop. You can use ẗhe countTokens()
method of StringTokenizer:
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
context.write(new Text("Sum: "), new IntWritable(tokenizer.countTokens()));
}
}
Upvotes: 4