Reputation: 399
I have a homework assignment in which I must retrieve the total number of distinct words in a certain document.
It's very similar to the WordCount example provided by Hadoop. But now I just want the total number of distinct words in the document. In the console output the number of reduce input groups corresponds to the total number of distinct words.
Is there a simple way to retrieve this number without even reducing the data. Or is Map/Reduce not the way to go for this problem. Chaining could also be a solution, but because the answer is already provided in the console output of the job I'm wondering if there isn't a simple way to retrieve the number of reduce input groups without doing stuff that isn't needed.
Greetings, Hadoop newcomer
Upvotes: 2
Views: 1023
Reputation: 39893
At some point, you want to group it, because there is no way to check for distinctness without bringing the data together.
Well, you are right on how to cheat cheat. And by cheat, I mean how I would do this in a production environment just because of how simple it is, but feels dirty anyways.
In your console output, look for "Reduce input groups=". This tells you how many groups your reducers received. One group maps to one key, which means each unique key is represented once.
Reduce input groups=146030
You could make your own counter to count the groups, but the number will be the same.
... Then use grep
or something like that to yank it out.
You can also query the job status through the API in the driver if you want to grab the counter value.
Your other option, which is obviously slower because it is an additional job: first phase, do word count; second phase, do line count.
The general way to do line count is to emit the same dummy string out as the key, and a 1, for each row. Basically, your map function is solely context.write(dummyText, one)
. Be sure to use a combiner and set the number of reducers to 1.
Upvotes: 1