Reputation: 51

Difference between combiner and in-mapper combiner in mapreduce?

I'm new to hadoop and mapreduce. Could someone clarify the difference between a combiner and an in-mapper combiner or are they the same thing?

Upvotes: 5

Answers (1)

WestCoastProjects

Reputation: 63259

You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it is shuffled across the network to the various cluster Reducers.

The in-mapper combiner takes this optimization a bit further: the aggregations do not even write to local disk: they occur in-memory in the Mapper itself.

The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods of

org.apache.hadoop.mapreduce.Mapper

to create an in-memory map along the following lines:

Map<LongWritable, Text> inmemMap = null
   protected void setup(Mapper.Context context) throws IOException, InterruptedException {
   inmemMap  = new Map<LongWritable, Text>();
 }

Then during each map() invocation you add values to than in memory map (instead of calling context.write() on each value. Finally the Map/Reduce framework will automatically call:

protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
  for (LongWritable key : inmemMap.keySet()) {
      Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on 
                   the inmemMap.     
      context.write(key, myAggregatedText);
  }
}

Notice that instead of calling context.write() every time, you add entries to the in-memory map. Then in the cleanup() method you call context.write() but with the condensed/pre-aggregated results from your in-memory map . Therefore your local map output spool files (that will be read by the reducers) will be much smaller.

In both cases - both in memory and external combiner - you gain the benefits of less network traffic to the reducers due to smaller map spool files. That also decreases the reducer processing.

Upvotes: 6

Difference between combiner and in-mapper combiner in mapreduce?

Answers (1)

Related Questions