Ishan Bhatt
Ishan Bhatt

Reputation: 3

use output of a reducer in another mapper

I am developing a map reduce application in which I have to get month starting and month ending data (not neccessarily first or last date of the month as they can be holidays or Sat-Sunday) So I am extracting the month as the key and the corresponding date as the value so that it gets aggregated month wise and I can extract max date and min date. Now based upon this date I need to use other attributes of the file. So I want to direct the output of one reducer into another mapper. This second mapper would also have the file as input and hence I can compare the dates and process the data accordingly. Is there any way I can do it??

Upvotes: 0

Views: 1060

Answers (1)

Jeremy Beard
Jeremy Beard

Reputation: 2725

At a high level one way to approach this would be to implement two MapReduce jobs that you run one after the other:

Job 1 takes the input data set and outputs key-value pairs for the start and end dates of each month to a single file by using a single reducer. This output file will be very small. This could be executed similarly to:

 hadoop jar yourjob.jar YourFirstDriverClass /path/to/input /path/to/kvp/output

Job 2 takes the same input data set, plus the path of the month dates file, and outputs the result of your processing. The month dates file is small enough that it can be opened and loaded into memory in the setup() call of each mapper or reducer. This could be executed similarly to:

hadoop jar yourjob.jar YourSecondDriverClass /path/to/input /path/to/kvp/output /path/to/final/output

In your driver main() you could pass the reference to the month dates file to the mappers and reducers similarly to:

getConf().set('month.dates.file', args[1]);

In your mapper or reducer setup() you can then load the data from the month dates file similarly to:

Configuration conf = context.getConfiguration();
Path path = new Path(conf.get('month.dates.file'));
FileSystem fs = FileSystem.get(conf);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
String line = br.readLine();
while (line != null) {
    // Read your month dates from line into a data structure, e.g. a Map
    line = br.readLine();
}

With your month dates loaded into a data structure in the mapper or reducer class you can then access them for each call of map() or reduce(), and process your input data accordingly.

This is obviously reasonably complex for what you are trying to do, and is a good example of why MapReduce abstractions such as Apache Hive, Apache Pig, and Apache Crunch are popular for implementing jobs with much less code.

Upvotes: 1

Related Questions