Mark Bolusmjak
Mark Bolusmjak

Reputation: 24399

Is Wikipedia's explanation of Map Reduce's reduce incorrect?

MongoDB's explanation of the reduce phase says:

The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.

This is how I always understood reduce to work in a general map reduce environment. Here you could sum values across N machines by reducing the values on each machine, then sending those outputs to another reducer.

Wikipedia says:

The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.

Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.

Is Wikipedia's description too specific? Or did MongoDB break map-reduce? (Or am I missing somethieng here?)

Upvotes: 5

Views: 804

Answers (3)

Thomas
Thomas

Reputation: 8930

TLDR : reduce (mongo) is like the combiner, and finalize(mongo) is almost like the reducer except that it takes just one key/value. If you need to have all your data in the reduce (hadoop) function, aggregate it with the reduce (mongo) into a big array and pass it to finalize. Use some sort of flags in the output values to do so.

That's how I do it and I think it would suck for big loads of data but I don't know of any other way to do it with mongodb mapreduce :( (but Im not very experienced with it)

Upvotes: 0

Praveen Sripati
Praveen Sripati

Reputation: 33495

According to the Google MapReduce paper

When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

MongoDB document says

The map/reduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent.

So, in case of the MapReduce as defined in the Google paper the reduce starts processing the key/value pairs once the data for a particular key has been transferred to the reducer. But, as Tomasz mentioned MongoDB seems to implement MapReduce in a slightly different way.

In the MapReduce proposed by Google either Map or Reduce tasks will be processing the KV pairs, but in the MongoDB implementation the Map and Reduce tasks will be simultaneously process the KV pairs. The MongoDB approach might not be efficient, since the nodes are not efficiently used and there is a chance that the Map and Reduce slots in the cluster are full and may not run new jobs.

The catch in Hadoop is although the reducers tasks don't process the KV pairs till the maps are done processing the data, the reducers tasks can be spawned before the mappers complete the processing. The parameter "mapreduce.job.reduce.slowstart.completedmaps" and is set to "0.05" and the description says "Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job."

Here you would need to move all values (with the same key) to the same machine to be summed. Moving data to the function seems to be the opposite of what map reduce is supposed to do.

Also, the data locality is considered for the map tasks and not the reduce tasks. For the reduce tasks the data has to be moved from different mappers on different nodes to the reducers for aggregation.

Just my 2c.

Upvotes: 2

Tomasz Nurkiewicz
Tomasz Nurkiewicz

Reputation: 340708

This is how the original Map Reduce framework was described by Google:

2 Programming Model

[...]

The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

And later:

3 Implementation

[...]

6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.

So there is only one invocation of Reduce. The problem of moving a lot of small intermediate pairs is addressed by using special combiner function locally:

4.3 Combiner Function

In some cases, there is significant repetition in the intermediate keys produced by each map task [...] We allow the user to specify an optional Combiner function that does partial merging of this data before it is sent over the network.

The Combiner function is executed on each machine that performs a map task. Typically the same code is used to implement both the combiner and the reduce functions. [...]

Partial combining significantly speeds up certain classes of MapReduce operations.

TL;DR

Wikipedia follows original MapReduce design, MongoDB designers taken a slightly different approach.

Upvotes: 4

Related Questions