user2715182
user2715182

Reputation: 723

map reduce with two input files, with one file processed based on another

I need to write a map reduce that takes input as two input files. First input file looks like this:

key1 , 25
key1 , 35
key1 , 60
key2 , 30
key3 , 45
key3 , 65

Second input file is as follows:

key1, -10
key2, -20
key3, -15

and I need to get an output as:

key1 , 15
key1 , 25
key1 , 50
key2 , 10
key3 , 30
key3 , 50

(The output is first input file's values subtracted by the second input file)

How could this be done? How will the mapper and reducer task look like?

My approach is as follows:

I think I will have to have two mappers, one per input file (Can a single mapper be used to read both the files?). Mappers will simply emit the key and the value.

At the reducer end, when I receive all values corresponding to a key, I have to subtract the values, that is coming from the first file, by the value in the second file.

So I need to find out whether the corresponding value is coming from the second input file or first file. how can this be done?

Any other better approaches?

Upvotes: 2

Views: 1837

Answers (2)

YoungHobbit
YoungHobbit

Reputation: 13402

This can be done in a single MapReduce program. You can use MultipleInputs support from MapReduce framework.

  • Define two mapper classes for each input file. Then output key, value as key#fileName, value pair.
  • Define a custom partitioner, which consider only actual key and ignore the appended fileName for partitioning the data. So that the same keys from both files goes to the same reducer.
  • The reducer will get the list of values for key from file1. Hold this list of values in memory and fetch the list of values from file2 as well for the same key. These two will come consecutively because we have partitioned the data on only key part and comparator will also sort them on key value. Assuming the first file name comes alphabetically. Then perform them required operation on first file value list using second file value.

    Configuration conf = new Configuration();
    Job job = new Job(conf, "aggprog");        
    
    MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,MapperOne.class);
    MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,MapperTwo.class);
    
    conf.setPartitionerClass(CustomPartitioner.class);
    

    Hope this helps.

Upvotes: 1

mattinbits
mattinbits

Reputation: 10428

Read in a separate mapper, and alter the contents so that you know which file they come from. e.g. output

key1 , 25 , file1
key1 , 35 , file1
key1 , 60 , file1
key2 , 30 , file1
key3 , 45 , file1
key3 , 65 , file1

key1, -10 , file2
key2, -20 , file2
key3, -15 , file2

Then, you can both outputs through a single mapreduce phase together, and you will know which is from where, and you can manipulate your data accordingly in your reducer.

Upvotes: 1

Related Questions