rapidclock
rapidclock

Reputation: 1717

Running Code after all reducers run

I am trying to write a mapreduce program that computes an average of some statistics.

The mappers read the data in its respective segment and perform some filters.

I am using multiple Reducers.

Therefore the reducers will be capable of calculating only the local average in that partition. I however need the average of all the data coming to all the reducers. How do i pull this off?

One idea is to use global counters to hold the sum and count. But i need a segment of code that runs after all reducers have run(so that i can operate on the final sum and count) and output the average to a file. Is this a viable approach and how can i do this?

Also note that i have to use multiple reducers. So the option of having just one reducer and doing the average computation in the cleanup method is out of the window.

Upvotes: 0

Views: 565

Answers (2)

ViKiG
ViKiG

Reputation: 783

If you insist on using multiple reducers for this job, then I guess you should be doing multiple (in your case 2) job chain. The first job will do whatever you have right now. The second job will be setup for calculating the overall average. So the first job's output goes as input for the second job.

You can see my answer here, to see how to set up a chain of jobs in a single driver class.

Upvotes: 0

RojoSam
RojoSam

Reputation: 1496

Option 1.- Implement a combiner and use only one reducer. The combiner will reduce the amount of data to be transfer to the reducer(s). If the reason of use more than one reducer is the amount of data that you are processing, this could be an option.

Option 2.- Inside each Mapper compute the partial sum/count in memory and just write to the output the aggregated values in the cleanup method. Allowing you to use only one reducer to compute the final average.

Option 3.- Implement your process using two map-reduce jobs. One to calculate a partial sum/count in each reducer and then other map-reduce with identity maps and with only one reducer to compute the average.

Option 4.- Use counters and, as @Thomas suggest, implement the logic after the waitForCompletion.

Option 5.- Use the output of the reducers to compute the average reading the HDFS files (use counters could be more simple).

In my opinion the Option 2 is the most simple and clean to implement. And the Option 1 the most generic option, useful if you require to calculate more than one average at the same time and compute the sum/counts in memory is not possible (counters are more restrictive, just some thousands).

Upvotes: 1

Related Questions