Duplicates in hadoop mapreduce

Question

I'm starting out with hadoop 0.20.2. I wanted to start out with the basic wordcount problem with the code I found here: http://cxwangyi.blogspot.com/2009/12/wordcount-tutorial-for-hadoop-0201.html

This works like it should. However, when the words are seperated over multiple files and I want to count words per file, so I change the mapper to:

String fileName = ((org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit()).getPath().getName();

            word.set(itr.nextToken()+"@"+fileName);

But then I get duplicates in my mapreduced file, like this: word1@file1 1 word2@file2 1 word2@file2~ 1 ...

So word2@file2~ 1 should not have been there...

Anybody knows what I'm doing wrong?

Thanks

Duplicates in hadoop mapreduce

Answers (1)

Related Questions