Reputation: 6921
I'm trying to run map reduce on mongodb in mongo shell. For some reason, in the reduce phase, I get several calls for the same key (instead of single one), so I get wrong results. I'm not an expert in this domains, so maybe I'm doing some stupid mistake. Any help appreciated.
Thanks.
This is my small example:
I'm creating 10000 documents:
var i = 0;
db.docs.drop();
while (i < 10000) {
db.docs.insert({text:"line " + i,index:i});
i++;
}
Then I'm doing map-reduce based on module 10 (so I except to get 1000 in each "bucket")
db.docs.mapReduce(
function() {
emit(this.index%10,1);
},
function(key,values) {
return values.length;
},
{
out : {inline : 1}
}
);
However, as results I get the following:
{
"results" : [
{
"_id" : 0,
"value" : 21
},
{
"_id" : 1,
"value" : 21
},
{
"_id" : 2,
"value" : 21
},
{
"_id" : 3,
"value" : 21
},
{
"_id" : 4,
"value" : 21
},
{
"_id" : 5,
"value" : 21
},
{
"_id" : 6,
"value" : 21
},
{
"_id" : 7,
"value" : 21
},
{
"_id" : 8,
"value" : 21
},
{
"_id" : 9,
"value" : 21
}
],
"timeMillis" : 76,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 500,
"output" : 10
},
"ok" : 1,
}
Upvotes: 5
Views: 1617
Reputation: 46301
Map/Reduce is essentially a recursive operation. In particular, the documented requirements for the reduce
function include the following statement:
MongoDB can invoke the
reduce
function more than once for the same key. In this case, the previous output from thereduce
function for that key will become one of the input values to the nextreduce
function invocation for that key.
Therefore, you have to expect that the input is merely the number that was counted by a previous invocation. The following code does that by actually adding the values:
db.docs.mapReduce(
function() { emit(this.index % 10, 1); },
function(key,values) { return Array.sum(values); },
{ out : {inline : 1} } );
Now, the emit(key, 1)
makes more sense in a way, because 1
is no longer just any number used to fill the array, but its value is considered.
As a sidenote, note how dangerous this is: For a smaller dataset, the correct result might have been given by accident, because the engine decided a parallelization wouldn't be necessary.
Upvotes: 6