eran
eran

Reputation: 6921

Reduce is called several times with the same key in mongodb map-reduce

I'm trying to run map reduce on mongodb in mongo shell. For some reason, in the reduce phase, I get several calls for the same key (instead of single one), so I get wrong results. I'm not an expert in this domains, so maybe I'm doing some stupid mistake. Any help appreciated.

Thanks.

This is my small example:

I'm creating 10000 documents:

var i = 0;
db.docs.drop();
while (i < 10000) {
    db.docs.insert({text:"line " + i,index:i});
    i++;
}

Then I'm doing map-reduce based on module 10 (so I except to get 1000 in each "bucket")

db.docs.mapReduce(
    function() { 
       emit(this.index%10,1);
    },
    function(key,values) {
       return values.length;
    },
    {
    out : {inline : 1}
    }
);

However, as results I get the following:

{
    "results" : [
        {
            "_id" : 0,
            "value" : 21
        },
        {
            "_id" : 1,
            "value" : 21
        },
        {
            "_id" : 2,
            "value" : 21
        },
        {
            "_id" : 3,
            "value" : 21
        },
        {
            "_id" : 4,
            "value" : 21
        },
        {
            "_id" : 5,
            "value" : 21
        },
        {
            "_id" : 6,
            "value" : 21
        },
        {
            "_id" : 7,
            "value" : 21
        },
        {
            "_id" : 8,
            "value" : 21
        },
        {
            "_id" : 9,
            "value" : 21
        }
    ],
    "timeMillis" : 76,
    "counts" : {
        "input" : 10000,
        "emit" : 10000,
        "reduce" : 500,
        "output" : 10
    },
    "ok" : 1,
}

Upvotes: 5

Views: 1617

Answers (1)

mnemosyn
mnemosyn

Reputation: 46301

Map/Reduce is essentially a recursive operation. In particular, the documented requirements for the reduce function include the following statement:

MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.

Therefore, you have to expect that the input is merely the number that was counted by a previous invocation. The following code does that by actually adding the values:

db.docs.mapReduce(
    function() { emit(this.index % 10, 1); }, 
    function(key,values) { return Array.sum(values); }, 
    { out : {inline : 1} } );

Now, the emit(key, 1) makes more sense in a way, because 1 is no longer just any number used to fill the array, but its value is considered.

As a sidenote, note how dangerous this is: For a smaller dataset, the correct result might have been given by accident, because the engine decided a parallelization wouldn't be necessary.

Upvotes: 6

Related Questions