Reputation: 69
How to eleminate duplicate values in a single file using hadoop mapreduce program
While in output i need unique values
For Example: in a file
line 1: Hi this is Ashok
Line 2: Basics of hadoop framework
line 3: Hi this is Ashok
From this example need output only unique values i.e. It should print Line 1 and 3... How to do it....
Upvotes: 1
Views: 1152
Reputation: 39903
This is word count without the count.
The typical way to do this is to group by the entire line, then only output the key in the reducer. Here is some pseudocode:
map(key, value):
emit (value, null)
reducer(key, iterator):
emit (key, null)
Notice that I'm just outputting value here as the key from the mapper. The value can be null (i.e., NullWriteable
, or you can just use an integer or whatever.).
In the reducer, I don't care how many I saw, I just output the key.
Upvotes: 8