Reputation: 35

Hadoop MapReduce Program for removing duplicate records

Could anyone help me to write the mapper and reducer for merging these two files and then removing the duplicate records?

These are the two text files:

file1.txt
2012-3-1a
2012-3-2b
2012-3-3c
2012-3-4d
2012-3-5a
2012-3-6b
2012-3-7c
2012-3-3c

and file2.txt:

2012-3-1b
2012-3-2a
2012-3-3b
2012-3-4d
2012-3-5a
2012-3-6c
2012-3-7d
2012-3-3c

Upvotes: 0

Answers (3)

Rauf

Reputation: 124

Here's code to remove duplicate lines in large text data, which uses hash for efficiency:

DRMapper.java

    import com.google.common.hash.Hashing;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    import java.io.IOException;
    import java.nio.charset.StandardCharsets;
    
    class DRMapper extends Mapper<LongWritable, Text, Text, Text> {
    
      private Text hashKey = new Text();
      private Text mappedValue = new Text();
    
      @Override
      public void map(LongWritable key, Text value, Context context)
          throws IOException, InterruptedException {
        String line = value.toString();

          hashKey.set(Hashing.murmur3_32().hashString(line, StandardCharsets.UTF_8).toString());
          mappedValue.set(line);
          context.write(hashKey, mappedValue);

      }
    
    }

DRReducer.java

    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import java.io.IOException;
    
    public class DRReducer extends Reducer<Text, Text, Text, NullWritable> {
      @Override
      public void reduce(Text key, Iterable<Text> values, Context context)
          throws IOException, InterruptedException {
        Text value;
        if (values.iterator().hasNext()) {
          value = values.iterator().next();
          if (!(value.toString().isEmpty())) {
            context.write(value, NullWritable.get());
          }
        }
      }
    }

DuplicateRemover.java

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    
    public class DuplicateRemover {
      private static final int DEFAULT_NUM_REDUCERS = 210;
    
      public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.err.println("Usage: DuplicateRemover <input path> <output path>");
          System.exit(-1);
        }
    
   
        Job job = new Job();
        job.setJarByClass(DuplicateRemover.class);
        job.setJobName("Duplicate Remover");
    
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
        job.setMapperClass(DRMapper.class);
        job.setReducerClass(DRReducer.class);
    
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
    
        job.setNumReduceTasks(DEFAULT_NUM_REDUCERS);
    
        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
    }

compile with:

javac -encoding UTF8 -cp $(hadoop classpath) *.java
jar cf dr.jar *.class

Assuming that the input text files are in in_folder, run as:

hadoop jar dr.jar in_folder out_folder

Upvotes: 0

CiscoJavaHadoop

Reputation: 141

A simple word count program will do the job for you. The only change you need to make is, set the output value of the Reducer as NullWritable.get()

Upvotes: 2

Vineet Srivastava

Reputation: 23

Is there a common key in both the files which helps identify if record matched or not? If so then: Mappers Input: Standard TextInputFormat Mapper's Output Key : Common Key and Mapper's output Value : Entire Record. At reducer : It will not be required to iterate over the Keys just take only 1 instance of the Value for Write.

If the match or duplicacy can be concluded only if complete record matched: then Mappers Input: Standard TextInputFormat Mapper's Output Key : Entire Record and Mapper's output Value : NullWritable. At reducer: It will not be required to iterate over the Keys. Just take only one instance of Key and write that as a Value. Reducer Output Key: Reducer Input Key, Reducer Output Value : NullWritable

Upvotes: 0

Hadoop MapReduce Program for removing duplicate records

Answers (3)

Related Questions