Reputation: 53
There have been quite a few questions like this one already, with conflicting answers. I've also found conflicting statements in the literature and on blogs. In the book, Hadoop, the Definitive Guide, it says
Hadoop does not provide a guarantee of how many times it will call [the combiner] for a particular map output record, if at all. In other words, calling the combiner function zero, one or many times should produce the same output from the reducer
The answers to a similar question here On what basis mapreduce framework decides whether to launch a combiner or not suggest that a combiner, if defined, will always be called once as the MapOutputBuffer needs to be flushed.
There might be an edge case where the mapper emits only once, meaning the combiner, even if defined, won't run.
My question is this: Is there a definitive source for the answer to this question? I've searched the Hadoop documentation, of course, but can't find anything.
Upvotes: 0
Views: 90
Reputation: 4721
Hadoop frameworks is aimed to provide a easy interface to users/developers to develop code which runs in distributed environment without having user/developer to think/handle the complexity of distributed systems.
To answer your question, you can read the source code which has logic to invoke combiner based on condition.
if (combinerRunner == null || numSpills < minSpillsForCombine) {
Merger.writeFile(kvIter, writer, reporter, job);
} else {
combineCollector.setWriter(writer);
combinerRunner.combine(kvIter, combineCollector);
}
So Combiner wont run if :
As most of the hadoop properties are configurable so the behaviour and performance depends on how you configure the properties.
Hope this answers your question.
Upvotes: 1