Reputation: 2112
I'm trying to do a simple map reduce in the Mongo shell, but the reduce function never gets called. This is my code :
db.sellers.mapReduce(
function(){ emit( this._id, 'Map') } ,
function(k,vs){ return 'Reduce' },
{ out: { inline: 1}})
And the result is
{
"results" : [
{
"_id" : ObjectId("4da0bdb56bd728c276911e1a"),
"value" : "Map"
},
{
"_id" : ObjectId("4da0df9a6bd728c276911e1b"),
"value" : "Map"
}
],
"timeMillis" : 0,
"counts" : {
"input" : 2,
"emit" : 2,
"output" : 2
},
"ok" : 1,
}
Whats wrong?
I'm using MongoDB 1.8.1 32 bit on Ubuntu 10.10
Upvotes: 8
Views: 3880
Reputation: 882
It should also be mentioned that, according to the documentation, "MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.".
Also, reduce
should be associative, commutative and idempotent:
reduce(key, [ C, reduce(key, [ A, B ]) ] ) == reduce( key, [ C, A, B ] )
reduce( key, [ reduce(key, valuesArray) ] ) == reduce( key, valuesArray )
reduce( key, [ A, B ] ) == reduce( key, [ B, A ] )
So, it means that the reduce
function should be ready to receive the object which is the result of the previous invocation of itself. Which (at least for me, personally) means that the best way to implement mapReduce
is to make the map
function to (if possible) emit values in the same format as the reduce
function returns. Then the reduce
function can be implemented to support only one input format. And, as a result, even if there's only one object emitted by map
(and the invocation of reduce
is skipped, as a result), in the final result of the mapReduce
, the value for the keys for which the reduce
was never called will still be in the same format as the value for the rest of the keys.
For instance, if we have the following document structure:
{
"foo": <some_string>,
"status": ("foo"|"bar")
}
the map
function may be as follows:
function() {
var value = {
"num_total": 1,
"num_foos": 0,
"num_bars": 0
};
if (this.status == "foo") {
value["num_foos"] += 1;
}
if (this.status == "bar") {
value["num_bars"] += 1;
}
emit(this.foo, value);
}
and the reduce
function will be:
function(key, values) {
var reduced = {
"num_total": 0,
"num_foos": 0,
"num_bars": 0
};
values.forEach(function(val) {
reduced["num_total"] += val["num_total"];
reduced["num_foos"] += val["num_foos"];
reduced["num_bars"] += val["num_bars"];
});
return reduced;
}
Upvotes: 1
Reputation: 49172
Map reduce will collect values with a common key into a single value.
In this case nothing is to be done because each value emitted by map has a different key. No reduction is needed.
db.sellers.mapReduce(
function(){ emit( this._id, 'Map') } ,
function(k,vs){ return 'Reduce' },
{ out: { inline: 1}})
This is not entirely clear from reading the documentation.
If you wanted to call reduce, you might hardcode an ID like this:
db.sellers.mapReduce(
function(){ emit( 1, 'Map') } ,
function(k,vs){ return 'Reduce' },
{ out: { inline: 1}})
Now all the values emitted by map will be reduced until only one remains.
Upvotes: 1
Reputation: 855
Well, the MongoDB does not call Reduce function on a key if there is only one value for it.
In my opinion, this is bad. It should be left to my reducer code to decide whether to skip a singular value or do some operation on it.
Now, if I have to do some operation on singular value, I end up writing the finalize function and in the finalize, I try to differentiate which value has gone through the reducer or which not.
I am very sure, it does not happen this way in case of Hadoop.
Upvotes: 6
Reputation: 340743
The purpose of reduce
is to, ekhem, reduce the set of values associated with a given key into a one value (aggregate results). If you emit only one value for each MapReduce key, there is not need for reduce, all the work is done. But if you emit two pairs for a given _id
, reduce will be called:
emit(this._id, 'Map1');
emit(this._id, 'Map2');
this will call reduce with the following parameters:
reduce(_id, ['Map1', 'Map2'])
More likely you will want to use _id
for MapReduce key when filtering dataset: emit
only when given record fulfills some condition. But again, reduce
won't be called in this case, which is expected.
Upvotes: 18