How to enforce a mapreduce program to execute combiner?

Question

I'm working on a MapReduce program in which I need to insert entities into a database. Due to some performance issue, inserting the entities into the database should be done in the combiner. My program has no reducer, so there is only mapper and combiner. As the Hadoop engine may not execute the combiner (combiner is optional), how can I enforce it to run the combiner?

Chris Nauroth · Accepted Answer

The MapReduce framework provides no supported way to force execution of the combiner. The combiner may be called 0, 1 or multiple times. The framework is free to make its own decisions about this.

The current implementation decides to run the combiner based on spills to disk occurring during map task execution. The Apache Hadoop documentation for mapred-default.xml documents several configuration properties that can have an effect on spill activity.


  mapreduce.map.sort.spill.percent
  0.80
  The soft limit in the serialization buffer. Once reached, a
  thread will begin to spill the contents to disk in the background. Note that
  collection will not block if this threshold is exceeded while a spill is
  already in progress, so spills may be larger than this threshold when it is
  set to less than .5



  mapreduce.task.io.sort.factor
  10
  The number of streams to merge at once while sorting
  files.  This determines the number of open file handles.



  mapreduce.task.io.sort.mb
  100
  The total amount of buffer memory to use while sorting 
  files, in megabytes.  By default, gives each merge stream 1MB, which
  should minimize seeks.

Additionally, there is an undocumented configuration property, mapreduce.map.combine.minspills, which defines the minimum number of spills required before running the combiner. The default value is 3 if unspecified.

It's possible that one could tune these configuration properties just right to set the conditions for triggering enough spills to exceed mapreduce.map.combine.minspills, and therefore guarantee at least one call to the combiner. However, I can't recommend that, because it would be very brittle. The logic would be extremely sensitive to external factors, like the size of the input data. Also, it would rely on specific implementation details of the current MapReduce codebase. Internal algorithms are subject to change, and those changes could break your assumptions. There is effectively no pubic API for forcing a run of the combiner.

Additionally, keep in mind that unlike a reducer, the combiner might not get a complete picture of all values associated with a specific key. If multiple map tasks process records with the same key, then the reducer is the only place guaranteed to see all of those values grouped together. Even within a single map task, the combiner may execute multiple times with different sub-sets of the key's values pulled from the input split it processed.

For a more standard solution to the problem of exporting data from Hadoop to a relational database, consider DBOutputFormat or Sqoop.

How to enforce a mapreduce program to execute combiner?

Answers (1)

Related Questions