Reputation: 51
I'm new to hadoop and mapreduce. Could someone clarify the difference between a combiner and an in-mapper combiner or are they the same thing?
Upvotes: 5
Views: 8314
Reputation: 63259
You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it is shuffled across the network to the various cluster Reducers.
The in-mapper combiner takes this optimization a bit further: the aggregations do not even write to local disk: they occur in-memory in the Mapper itself.
The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods of
org.apache.hadoop.mapreduce.Mapper
to create an in-memory map along the following lines:
Map<LongWritable, Text> inmemMap = null
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
inmemMap = new Map<LongWritable, Text>();
}
Then during each map() invocation you add values to than in memory map (instead of calling context.write() on each value. Finally the Map/Reduce framework will automatically call:
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
for (LongWritable key : inmemMap.keySet()) {
Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on
the inmemMap.
context.write(key, myAggregatedText);
}
}
Notice that instead of calling context.write() every time, you add entries to the in-memory map. Then in the cleanup() method you call context.write() but with the condensed/pre-aggregated results from your in-memory map . Therefore your local map output spool files (that will be read by the reducers) will be much smaller.
In both cases - both in memory and external combiner - you gain the benefits of less network traffic to the reducers due to smaller map spool files. That also decreases the reducer processing.
Upvotes: 6