Hadoop Java Word Count tweak not working - try to sum all

Question

I'm trying to tweak the wordcount example found here: http://wiki.apache.org/hadoop/WordCount so it will sum and return the number of words in the input file instead of counting the occurrences of each word.

I tried changing the mapper class in a way that instead of writing the word in the current iteration, it will write "Sum: " for all words.

i.e. replace

 word.set(tokenizer.nextToken());

@class "Map" with

 word.set("Sum: ");

All the rest of the file remains the same.

In that way I thought all mappers output get to the same reducer that will eventually sum the number of "sum: "'s which will eventually be the number of words in the file.

meaning instead of:

 word  1
 other 1
 other 1

that yields:

word  1
other 2

I was expecting to have:

 Sum:  1
 Sum:  1
 Sum:  1

that yields:

 Sum: 3

Instead, when I try and run the code, I get a really long mapping operation that eventually end up with an exeption thrown:

RuntimeException: java.io.IOException: Spill failed

no matter how small the input file is.

Looking forward for your help. Thank you

jmiserez · Accepted Answer

You have an endless loop. In your code, you need to call

tokenizer.nextToken()

to actually advance the StringTokenizer by a word from the line. Otherwise your mapping operation will never make progress.

So you would need something like this:

public static class Map extends Mapper {
        private final static IntWritable one = new IntWritable(1);
        private Text sumText = new Text("Sum: ");
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            tokenizer.nextToken(); //go to next word
            context.write(sumText, one);
        }
    }
}

However, there is a better solution without a loop. You can use ẗhe countTokens() method of StringTokenizer:

public static class Map extends Mapper {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        context.write(new Text("Sum: "), new IntWritable(tokenizer.countTokens()));
    }
}

Hadoop Java Word Count tweak not working - try to sum all

Answers (1)

Related Questions