Ashok
Ashok

Reputation: 69

How to eleminate duplicate values in a single file using hadoop mapreduce program


How to eleminate duplicate values in a single file using hadoop mapreduce program

While in output i need unique values

For Example: in a file

line 1: Hi this is Ashok

Line 2: Basics of hadoop framework

line 3: Hi this is Ashok

From this example need output only unique values i.e. It should print Line 1 and 3... How to do it....

Upvotes: 1

Views: 1152

Answers (1)

Donald Miner
Donald Miner

Reputation: 39903

This is word count without the count.

The typical way to do this is to group by the entire line, then only output the key in the reducer. Here is some pseudocode:

map(key, value):
   emit (value, null)

reducer(key, iterator):
   emit (key, null)

Notice that I'm just outputting value here as the key from the mapper. The value can be null (i.e., NullWriteable, or you can just use an integer or whatever.).

In the reducer, I don't care how many I saw, I just output the key.

Upvotes: 8

Related Questions