bauz
bauz

Reputation: 125

mongoDB - MapReduce finalize creates NaN values

I created a MapReduce Job for the following data structure:

{ "_id" : 1), "docid" : 119428, "term" : 5068, "score" : 0.198 }
{ "_id" : 2), "docid" : 154690, "term" : 5068, "score" : 0.21 }
{ "_id" : 3), "docid" : 156278, "term" : 5068, "score" : 0.128 }

{ "_id" : 4), "docid" : 700, "term" : "fire", "score" : 0.058 }
{ "_id" : 5), "docid" : 857, "term" : "fire", "score" : 0.133 }
{ "_id" : 6), "docid" : 900, "term" : "fire", "score" : 0.191 }
{ "_id" : 7), "docid" : 902, "term" : "fire", "score" : 0.047 }

I want to group by the term and then calculate the average score.

This is my MapReduce function:

db.keywords.mapReduce( 
  function(){ 
      emit( this.term, this.score ); 
  }, 
  function(key, values) { 
      rv = { cnt : 0, scoresum : 0}; 
      rv.cnt = values.length; rv.scoresum = Array.sum(values); 
      return rv; 
  },  
  { 
     out: "mr_test" , 
     finalize: function(key, reduceVal) { 
        reduceVal.avg = reduceVal.scoresum / reduceVal.cnt; 
        return reduceVal;  
     } 
   } 
)

Some calculated values are correct:

{ "_id" : 5068, "value" : { "cnt" : 5, "scoresum" : 0.887, "avg" : 0.1774 } }

but others are creating some strange structure:

    { "_id" : "fire", "value" : { "cnt" : 333, "scoresum" : "[object 
BSON][object BSON]0.176[object BSON]0.1010.181[object BSON][object .....BSON]
[object BSON][object BSON]0.1910.1710.2010.363[object BSON][object BSON]", "avg" : NaN } }

What is wrong with my MapReduce function?

Upvotes: 1

Views: 1365

Answers (1)

Blakes Seven
Blakes Seven

Reputation: 50426

You have missed the basic rule of processing mapReduce operations from the documentation:

MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.

This means that "both" mapper and reducer functions must emit exactly the same structure and consider that structure for input. The problem of course being that if you are outputing a different structure in the reduce function, then the next time it comes back into the reduce, the structure of input is not what is expected.

This is how mapReduce handles large data for the same key, by gradurally reducing, over and over until there is only a single result for a given key:

db.keywords.mapReduce( 
  function(){ 
      emit( this.term, { "cnt": 1, "score": this.score } );
  }, 
  function(key, values) { 
      rv = { "cnt" : 0, "score" : 0 }; 
      values.forEach(function(value) {
          rv.cnt += value.cnt;
          rv.score += value.score;
      });
      return rv;
  },  
  { 
      "out": "mr_test" , 
      "finalize": function(key, reduceVal) { 
          reduceVal.avg = reduceVal.score / reduceVal.cnt; 
          return reduceVal;  
      } 
   }
)

But actually the whole thing is much more efficiently done with the .aggregate() method:

db.keywords.aggregate([
    { "$group": {
        "_id": "$term",
        "cnt": { "$sum": 1 },
        "score": { "$sum": "$score" },
        "avg": { "$avg": "$score" }
    }},
    { "$out": "aggtest" }
])

Which even has the $avg aggregation accumulator which gives you averages in a single pass.

Unlike mapReduce, the operators used here execute in native code as opposed to interpreted JavaScript. The result is much faster, and with fewer passes through the data.

In fact there is only one pass on $group, with $out being just an optional output to a collection, rather than returning a cursor which would be the default. And a cursor is yet another advantage over mapReduce.

Upvotes: 8

Related Questions