raj
raj

Reputation: 3811

custom word count using hadoop

I'm a beginer in hadoop. I've understood the WordCount program. Now I have a problem. I dont want the output of all the words..

- Words_I_Want.txt -
hello
echo
raj

- Text.txt -
hello eveyone. I want hello and echo count


output should be
hello 2
echo 1
raj 0


Now that was an exaple, My actual data is very large.

Upvotes: 2

Views: 1963

Answers (2)

Jieren
Jieren

Reputation: 1952

matt b's answer is definitely good for large to small joins but let's assume you're doing a large to large join.

You can map Words_I_Want.txt: k: the word, v: some marker

You can then map Text.txt: k: the word, v: 1 (same as the standard word count)

You'll have to use MultipleInputs and figure out which file is which using conf.get("map.input.file").

Then in the reduce step you can only collect output when the key has a marker.

Upvotes: 0

matt b
matt b

Reputation: 139931

In the WordCount example, the Mapper outputs each tokenized word from the input value and the number 1:

while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    output.collect(word, one);
}

If you only want to count certain words, then wouldn't you want to only output words from your Mapper that are matches against your list?

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    if (wordsThatYouCareAbout.contains(token)) {
        word.set(token);
        output.collect(word, one);
    }
}

Upvotes: 2

Related Questions