Reputation: 3811
I'm a beginer in hadoop. I've understood the WordCount program. Now I have a problem. I dont want the output of all the words..
- Words_I_Want.txt -
hello
echo
raj
- Text.txt -
hello eveyone. I want hello and echo count
output should be
hello 2
echo 1
raj 0
Now that was an exaple, My actual data is very large.
Upvotes: 2
Views: 1963
Reputation: 1952
matt b's answer is definitely good for large to small joins but let's assume you're doing a large to large join.
You can map Words_I_Want.txt: k: the word, v: some marker
You can then map Text.txt: k: the word, v: 1 (same as the standard word count)
You'll have to use MultipleInputs and figure out which file is which using conf.get("map.input.file").
Then in the reduce step you can only collect output when the key has a marker.
Upvotes: 0
Reputation: 139931
In the WordCount example, the Mapper
outputs each tokenized word from the input value and the number 1:
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
If you only want to count certain words, then wouldn't you want to only output words from your Mapper
that are matches against your list?
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
if (wordsThatYouCareAbout.contains(token)) {
word.set(token);
output.collect(word, one);
}
}
Upvotes: 2