gremo
gremo

Reputation: 48899

MongoDB MapReduce, different results with the "same approach", what I'm missing?

I know I'm missing something with MapReduce in MongoDB. I'm trying to build a tag-frequency collection and I'm getting different results, even if it seems that map and reduce functions are the "same".

Example document (forget values 100, 45... I'm not using them):

{
    ...
    tags: [['Rock', 100], ['Indie Pop', 45], ...]
}

Emitting a scalar value 1:

var map = function () {
    if (this.tags) {
        this.tags.forEach(function (tag) {
            emit(tag[0], 1); // Emit just 1
        });
    }
};

var reduce = function (key, vals) { // Vals should be [1, ...]
    return vals.length; // Count the length of the array
};

db.tracks.mapReduce(map, reduce, { out: 'mapreduce_out' });
db.mapreduce_out.find().sort({ value: -1 }).limit(3);

Output is:

{ "_id" : "rubyrigby1", "value" : 9 }
{ "_id" : "Dom", "value" : 7 }
{ "_id" : "Feel Better", "value" : 7 }

Emitting an object { count: 1 }:

var map = function () {
    if (this.tags) {
         this.tags.forEach(function (tag) {
            emit(tag[0], { count: 1 }); // Emit an object
         });
    }
};

var reduce = function (key, vals) { // vals should be [{ count: 1 }, ...]
    var count = 0;

    vals.forEach(function (val) {
        count += val.count; // Accumul
    });

    return { count: count };
};

db.tracks.mapReduce(map, reduce, { out: 'mapreduce_out' });
db.mapreduce_out.find().sort({ 'value.count': -1 }).limit(3);

Result is different and appears to be "right":

{ "_id" : "rock", "value" : { "count" : 9472 } }
{ "_id" : "pop", "value" : { "count" : 7103 } }
{ "_id" : "electronic", "value" : { "count" : 5727 } }

What's wrong with the first approach?

Upvotes: 2

Views: 321

Answers (1)

A. Jesse Jiryu Davis
A. Jesse Jiryu Davis

Reputation: 24009

Consider a collection of a thousand documents all with the tag 'tagname':

for (var i = 0; i < 1000; i++) {
    db.collection.insert({tags: [['tagname']]});
}

If I write a proper mapReduce I should get the output {"_id": "tagname", "count": 1000}. But if I use your map and reduce functions I'll get a count of 101 instead of 1000.

The reason is, MongoDB calls your reduce function repeatedly with intermediate results, in order to avoid keeping too large a batch of results in memory. You can actually see this by putting a print statement in your reduce:

var reduce = function (key, vals) {
    print(vals);
    return vals.length; // Count the length of the array
};

The print output appears in the server log. The reduce function is called with the first 100 1's, and it returns 100. So far so good. Then MongoDB calls it again with the first reduce's output plus the next 100 1's:

reduce([100, 1, 1, ..., 1]) // 100 plus 100 more 1's

So now it returns 101, because that's the length of the array. But clearly it should return 200, the sum of the array. So to get a correct result, change your reduce function:

reduce = function (key, vals) {
    var sum = 0;
    vals.forEach(function(val) { sum += val; });
    return sum;
}

Upvotes: 4

Related Questions