Reputation: 24399
MongoDB's explanation of the reduce phase says:
The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.
This is how I always understood reduce to work in a general map reduce environment. Here you could sum values across N machines by reducing the values on each machine, then sending those outputs to another reducer.
The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.
Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.
Is Wikipedia's description too specific? Or did MongoDB break map-reduce? (Or am I missing somethieng here?)
Upvotes: 5
Views: 804
Reputation: 8930
TLDR : reduce (mongo) is like the combiner, and finalize(mongo) is almost like the reducer except that it takes just one key/value. If you need to have all your data in the reduce (hadoop) function, aggregate it with the reduce (mongo) into a big array and pass it to finalize. Use some sort of flags in the output values to do so.
That's how I do it and I think it would suck for big loads of data but I don't know of any other way to do it with mongodb mapreduce :( (but Im not very experienced with it)
Upvotes: 0
Reputation: 33495
According to the Google MapReduce paper
When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
MongoDB document says
The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.
So, in case of the MapReduce as defined in the Google paper the reduce starts processing the key/value pairs once the data for a particular key has been transferred to the reducer. But, as Tomasz mentioned MongoDB seems to implement MapReduce in a slightly different way.
In the MapReduce proposed by Google either Map or Reduce tasks will be processing the KV pairs, but in the MongoDB implementation the Map and Reduce tasks will be simultaneously process the KV pairs. The MongoDB approach might not be efficient, since the nodes are not efficiently used and there is a chance that the Map and Reduce slots in the cluster are full and may not run new jobs.
The catch in Hadoop is although the reducers tasks don't process the KV pairs till the maps are done processing the data, the reducers tasks can be spawned before the mappers complete the processing. The parameter "mapreduce.job.reduce.slowstart.completedmaps" and is set to "0.05" and the description says "Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job."
Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.
Also, the data locality is considered for the map tasks and not the reduce tasks. For the reduce tasks the data has to be moved from different mappers on different nodes to the reducers for aggregation.
Just my 2c.
Upvotes: 2
Reputation: 340708
This is how the original Map Reduce framework was described by Google:
2 Programming Model
[...]
The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.
And later:
3 Implementation
[...]
6.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.
So there is only one invocation of Reduce
. The problem of moving a lot of small intermediate pairs is addressed by using special combiner function locally:
4.3 Combiner Function
In some cases, there is significant repetition in the intermediate keys produced by each map task [...] We allow the user to specify an optional
Combiner
function that does partial merging of this data before it is sent over the network.The
Combiner
function is executed on each machine that performs a map task. Typically the same code is used to implement both the combiner and the reduce functions. [...]Partial combining significantly speeds up certain classes of MapReduce operations.
Wikipedia follows original MapReduce design, MongoDB designers taken a slightly different approach.
Upvotes: 4