Seba Kerckhof
Seba Kerckhof

Reputation: 1391

Duplicates in hadoop mapreduce

I'm starting out with hadoop 0.20.2. I wanted to start out with the basic wordcount problem with the code I found here: http://cxwangyi.blogspot.com/2009/12/wordcount-tutorial-for-hadoop-0201.html

This works like it should. However, when the words are seperated over multiple files and I want to count words per file, so I change the mapper to:

String fileName = ((org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit()).getPath().getName();

            word.set(itr.nextToken()+"@"+fileName);

But then I get duplicates in my mapreduced file, like this: word1@file1 1 word2@file2 1 word2@file2~ 1 ...

So word2@file2~ 1 should not have been there...

Anybody knows what I'm doing wrong?

Thanks

Upvotes: 0

Views: 444

Answers (1)

Brainlag
Brainlag

Reputation: 628

Are you sure you don't have a file with the tilde at the end added to the input for the hadoop job? Some editors like Gedit generate them every time the file gets edited.

Upvotes: 2

Related Questions