jamesamuir
jamesamuir

Reputation: 1457

MongoDB Map Reduce giving mixed results

Given this document format

{
    "_id" : ObjectId("55e99afda8deab702bb51001"),   
    "shippingStatus" : "",   
    "skuOwner" : ObjectId("55e99afd670a4c5b16e2a6ec")    
}

Here is a map reduce that I am trying to run

inventory_map = function() {
    var values = {
        inventory: this._id,       
        count: 1
    };
    emit(this.skuOwner, values);
};

reduce = function(key, values) {
  var result = {      
      "openCount": 0,
      "inventory": []     
    };

    values.forEach(function(value) {
      result.openCount += 1;
      if(value.inventory !== null) {result.inventory.push(value.inventory)}
    });

    return result;
}


res = db.inventories.mapReduce(inventory_map, reduce, {out: 'openInventory', query: {shippingStatus: {$ne: 'SHIPPED'}}});

Here are the results

enter image description here

I would expect that every one of my documents would conform to the result object that I specified but this does not seem to be the case. Can someone explain to me why I am seeing this behavior?

Upvotes: 0

Views: 60

Answers (2)

Blakes Seven
Blakes Seven

Reputation: 50406

Same old basic problem, but really hard to mark these as "duplicate" since all the implementations are actually different, but the "same" cause of the problem is always the case.

You are using the wrong method here anyway, but please read on to find out how to do it right.

When reading up on mapReduce you basically missed this vital piece of information:

MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.

And also later:

the type of the return object must be identical to the type of the value emitted by the map function.

What that means and what you are basically doing wrong here is that your "mapper" is returning compltely different data to what your "reducer" is emitting itself. The problem being that because the reducer can take the "previous output from the reduce function" as input itself and basically "reduce again" then this is where everything fails.

To clarify, the "reduce" is not "all or nothing", but rather an "incremental" approach where not all of the common key values are presented to the function "all at once". Instead only a small "sub-set" of the values are presented and the returned output can be "fed into reduce" yet again. This is basically how you deal with "big data" results, by processing in "chunks" rather than all at once.

Fixing this is generally as simple as making the "mapper" produce the same "output" as the "reducer" expects for "input" and will itself produce as "output". So simple changes make all the difference here:

inventory_map = function() {
    var values = {
        inventory: [this._id],       
        openCount: 1                // all we changed on both
    };
    emit(this.skuOwner, values);
};

reduce = function(key, values) {
  var result = {      
      "openCount": 0,
      "inventory": []     
    };

    values.forEach(function(value) {
      result.openCount += value.openCount;           // and that too
      result.inventory = result.inventory.concat(value.inventory);      // that as well i guess
    });

    return result;
}

Now everything is the same from "output" of both "mapper" and "reducer" and the "reducer" also respects the same things as "input", so it works.

The "other hand" of this is that it really seems like you "should" be using .aggregate() instead. As the operations are very simple and also work "a lot faster" than mapReduce does, since the operators are all natively coded and not using JavaScript interpretation:

db.inventories.aggregate([
    { "$group": {
        "_id": "$skuOwner",
        "inventory": { "$push": "$_id" },
        "count": { "$sum": 1 }
    }}
])

Much more simple, a lot faster and also basically concise. Learn it well.

Upvotes: 2

Philipp
Philipp

Reputation: 69663

An important requirement of MapReduce is that the output format of the map-function and the output-format of the reduce-function are identical. This is not the case in your code. Your map output format is:

{
    inventory: this._id,       
    count: 1
};

and your reduce output format is:

{      
    openCount: 0,
    inventory: []     
};

The reason why these formats must be identical is because when there is only one value for a key provided by map, that result MIGHT not be passed to reduce at all and get directly passed to the output. Also, any of the results from reduce MIGHT be put into another round of reduce with previously unprocessed results (this usually only happens when processing very large datasets or when you process data from multiple shards).

Those results which still have a count field and where inventory is still a single value and not an array were never passed to your reduce function.

To fix this issue, modify your map function to return results which look identical to the output of your reduce function:

inventory_map = function() {
    var value = {
        inventory: [ this._id ],       
        openCount: 1
    };
    emit(this.skuOwner, value);
};

and modify your reduce function accordingly:

reduce = function(key, values) {
  var result = {      
      "openCount": 0,
      "inventory": []     
    };

    values.forEach(function(value) {
      result.openCount += value.openCount;  // <--!!!
      if(value.inventory !== null) {
         result.inventory = result.inventory.concat(value.inventory); // <--!!!
      }
    });

    return result;
}

By the way: A simpler way to solve your issue might be an aggregation:

db.inventories.aggregate([
    { $match: {
        shippingStatus: {$ne: 'SHIPPED'}
    }},
    { $group: {
       _id: "$skuOwner",
       openCount: { $sum:1 }
    }},
    { $out: "openInventory" }
]);

Upvotes: -1

Related Questions