Chitra
Chitra

Reputation: 198

Deduplication in Hadoop

I have a large amount of ingested device data that contains duplicates. I also have a separate list of history of ingested metadata(to uniquely identify an ingested file). I am looking to deduplicate my ingested device data with the history. This history file is not small and hence I am not looking at loading it in memory. I have considered Reduce side join as well but that would be passing huge amounts of data through the network.

Bloom Filter is something I am looking at to reduce the size of my history file. But it is giving me the opposite, i.e, it may report that I have a duplicate when I don't.

Deduplication seems to be a fairly common problem and I am looking to see if anyone else has possible ideas.

Upvotes: 3

Views: 5630

Answers (2)

jmiserez
jmiserez

Reputation: 3109

If you are going to use Map/Reduce for deduplication and you want to use multiple machines for the task, you have to send all your data over the network. That is what Hadoop does!

Of course you can also run everything on one machine, it will just take longer. At it's core, deduplication is one of the things Hadoop does naturally and you get most of the functionality for free: Hadoop hashes all your "keys" in the Map step and ensures that all "values" belonging to a "key" end up on the same Reducer.

The task itself is fairly simple, actually it is almost the same as the WordCount example (one of the simplest Map/Reduce jobs). Just skip outputting the count and output only the key (use NullWritable for the value). I've included the map and reduce functions below. Note: If you are using N multiple machines for the Reducers, you will need to concatenate the resulting N output files from each Reducer to get back a single file. Here is the code:

public void map(LongWritable key, Text value, Context context) 
  throws IOException, InterruptedException {
     String line = value.toString(); //process your data here
     context.write(line, NullWritable.get());
 }


public void reduce(Text key, Iterable<IntWritable> values, Context context) 
  throws IOException, InterruptedException {
     context.write(key, NullWritable.get());
 }

Edit 1: If you want to use a Combiner as suggested by the other answer, you can do so very easily. A Combiner is run before the data is sent over the network, you can think of it as a local Reducer. Just set

job.setCombinerClass(Reduce.class);

where Reduce is your class containing the reduce() method.


Edit 2: As per a suggestion I received: The value.toString() is superfluous and not needed if you only have strings to deal with and do not need to do any processing at all. You could then simplify the Mapper a bit:

public void map(LongWritable key, Text value, Context context) 
  throws IOException, InterruptedException {
     context.write(value, NullWritable.get());
 }

Upvotes: 4

Judge Mental
Judge Mental

Reputation: 5239

Do not forget that a Combiner is the single best way to reduce network traffic if you have lots and lots of duplicates, enough that a single host in the cluster will already have many duplicates.

Upvotes: 0

Related Questions