Kevin
Kevin

Reputation: 53

Definitive source for when Hadoop MapReduce Runs a Combiner

There have been quite a few questions like this one already, with conflicting answers. I've also found conflicting statements in the literature and on blogs. In the book, Hadoop, the Definitive Guide, it says

Hadoop does not provide a guarantee of how many times it will call [the combiner] for a particular map output record, if at all. In other words, calling the combiner function zero, one or many times should produce the same output from the reducer

The answers to a similar question here On what basis mapreduce framework decides whether to launch a combiner or not suggest that a combiner, if defined, will always be called once as the MapOutputBuffer needs to be flushed.

There might be an edge case where the mapper emits only once, meaning the combiner, even if defined, won't run.

My question is this: Is there a definitive source for the answer to this question? I've searched the Hadoop documentation, of course, but can't find anything.

Upvotes: 0

Views: 90

Answers (1)

Pradeep Bhadani
Pradeep Bhadani

Reputation: 4721

Hadoop frameworks is aimed to provide a easy interface to users/developers to develop code which runs in distributed environment without having user/developer to think/handle the complexity of distributed systems.

To answer your question, you can read the source code which has logic to invoke combiner based on condition.

Line 1950 - Line 1955 https://github.com/apache/hadoop/blob/0b8a7c18ddbe73b356b3c9baf4460659ccaee095/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java

 if (combinerRunner == null || numSpills < minSpillsForCombine) {
     Merger.writeFile(kvIter, writer, reporter, job);
 } else {
     combineCollector.setWriter(writer);
     combinerRunner.combine(kvIter, combineCollector);
 }

So Combiner wont run if :

  • It is not defined , or
  • If the spills are less than minSpillsForCombine. minSpillForCombine is driven by property "mapreduce.map.combine.minspills" whose default value is 3.

As most of the hadoop properties are configurable so the behaviour and performance depends on how you configure the properties.

Hope this answers your question.

Upvotes: 1

Related Questions