MongoDB MapReduce strange results

Question

When I perform a Mapreduce operation over a MongoDB collection with an small number of documents everything goes ok.

But when I run it with a collection with about 140.000 documents, I get some strange results:

Map function:

function() { emit(this.featureType, this._id); }

Reduce function:

function(key, values) { return { count: values.length, ids: values };

As a result, I would expect something like (for each mapping key):

{
"_id": "FEATURE_TYPE_A",
"value": { "count": 140000,
           "ids": [ "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
                    "db364b3f-045f-4cb8-a52e-2267df40066c",
                    "d2152826-6777-4cc0-b701-3028a5ea4395",
                    "7ba366ae-264a-412e-b653-ce2fb7c10b52",
                    "513e37b8-94d4-4eb9-b414-6e45f6e39bb5", .......}

But instead I get this strange document structure:

{
"_id": "FEATURE_TYPE_A",
"value": {
    "count": 706,
    "ids": [
        {
            "count": 101,
            "ids": [
                {
                    "count": 100,
                    "ids": [
                        "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
                        "db364b3f-045f-4cb8-a52e-2267df40066c",
                        "d2152826-6777-4cc0-b701-3028a5ea4395",
                        "7ba366ae-264a-412e-b653-ce2fb7c10b52",
                        "513e37b8-94d4-4eb9-b414-6e45f6e39bb5".....}

Could someone explain me if this is the expected behavior, or am I doing something wrong?

Thanks in advance!

Neil Lunn · Accepted Answer

The case here is un-usual and I'm not sure if this is what you really want given the large arrays being generated. But there is one point in the documentation that has been missed in the presumption of how mapReduce works.

MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.

What that basically says here is that your current operation is only expecting that "reduce" function to be called once, but this is not the case. The input will in fact be "broken up" and passed in here as manageable sizes. The multiple calling of "reduce" now makes another point very important.

Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:

the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:

Essentially this means that both your "mapper" and "reducer" have to take on a little more complexity in order to produce your desired result. Essentially making sure that the output for the "mapper" is sent in the same form as how it will appear in the "reducer" and the reduce process itself is mindful of this.

So first the mapper revised:

function () { emit(this.type, { count: 1, ids: [this._id] }); }

Which is now consistent with the final output form. This is important when considering the reducer which you now know will be invoked multiple times:

function (key, values) {

  var ids = [];
  var count = 0;

  values.forEach(function(value)  {
    count += value.count;
    value.ids.forEach(function(id) {
      ids.push( id );
    });
  });

  return { count: count, ids: ids };

}

What this means is that each invocation of the reduce function expects the same inputs as it is outputting, being a count field and an array of ids. This gets to the final result by essentially

Reduce one chunk of results #chunk1
Reduce another chunk of results #chunk2
Comnine the reduce on the reduced chunks, #chunk1 and #chunk2

That may not seem immediately apparent, but the behavior is by design where the reducer gets called many times in this way to process large sets of emitted data, so it gradually "aggregates" rather than in one big step.

The aggregation framework makes this a lot more straightforward, where from MongoDB 2.6 and upwards the results can even be output to a collection, so if you had more than one result and the combined output was greater than 16MB then this would not be a problem.

db.collection.aggregate([
    { "$group": {
        "_id": "$featureType",
        "count": { "$sum": 1 },
        "ids": { "$push": "$_id" }
    }},
    { "$out": "ouputCollection" }
])

So that will not break and will actually return as expected, with the complexity greatly reduced as the operation is indeed very straightforward.

But I have already said that your purpose for returning the array of "_id" values here seems unclear in your intent given the sheer size. So if all you really wanted was a count by the "featureType" then you would use basically the same approach rather than trying to force mapReduce to find the length of an array that is very large:

db.collection.aggregate([
    { "$group": {
        "_id": "$featureType",
        "count": { "$sum": 1 },
    }}
])

In either form though, the results will be correct as well as running in a fraction of the time that the mapReduce operation as constructed will take.

MongoDB MapReduce strange results

Answers (1)

Related Questions