MapReduce function to return two outputs. MongoDB

Question

I am currently using doing some basic mapReduce using MongoDB.

I currently have data that looks like this:

db.football_team.insert({name: "Tane Shane", weight: 93, gender: "m"});
db.football_team.insert({name: "Lily Jones", weight: 45, gender: "f"});
...

I want to create a mapReduce function to group data by gender and show

Total number of each gender, Male & Female
Average weight of each gender

I can create a map / reduce function to carry out each function seperately, just cant get my head around how to show output for both. I am guessing since the grouping is based on Gender, Map function should stay the same and just alter something ont he reduce section...

Work so far

var map1 = function()
           {var key = this.gender;
            emit(key, {count:1});}

var reduce1 = function(key, values)
              {var sum=0;
               values.forEach(function(value){sum+=value["count"];});
               return{count: sum};};

db.football_team.mapReduce(map1, reduce1, {out: "gender_stats"});

Output

db.football_team.find()
{"_id" : "f", "value" : {"count": 12} }
{"_id" : "m", "value" : {"count": 18} }

Thanks

Neil Lunn · Accepted Answer

The key rule to "map/reduce" in any implementation is basically that the same shape of data needs to be emitted by the mapper as is also returned by the reducer. The key reason for this is part of how "map/reduce" conceptually works by quite possibly calling the reducer multiple times. Which basically means you can call your reducer function on output that was already emitted from a previous pass through the reducer along with other data from the mapper.

MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.

That said, your best approach to "average" is therefore to total the data along with a count, and then simply divide the two. This actually adds another step to a "map/reduce" operation as a finalize function.

db.football_team.mapReduce(
  // mapper
  function() {
    emit(this.gender, { count: 1, weight: this.weight });
  },
  // reducer
  function(key,values) {
    var output = { count: 0, weight: 0 };

    values.forEach(value => {
      output.count += value.count;
      output.weight += value.weight;
    });

    return output;
  },
  // options and finalize
  {
    "out": "gender_stats",   // or { "inline": 1 } if you don't need another collection
    "finalize": function(key,value) {
      value.avg_weight = value.weight / value.count;  // take an average
      delete value.weight;                            // optionally remove the unwanted key

      return value;
    }
  }
)

All fine because both mapper and reducer are emitting data with the same shape and also expecting input in that shape within the reducer itself. The finalize method of course is just invoked after all "reducing" is finally done and just processes each result.

As noted though, the aggregate() method actually does this far more effectively and in native coded methods which do not incur the overhead ( and potential security risks ) of server side JavaScript interpretation and execution:

db.football_team.aggregate([
  { "$group": {
    "_id": "$gender",
    "count": { "$sum": 1 },
    "avg_weight": { "$avg": "$weight" }
  }}
])

And that's basically it. Moreover you can actually continue and do other things after a $group pipeline stage ( or any stage for that matter ) in ways that you cannot do with a MongoDB mapReduce implementation. Notably something like applying a $sort to the results:

db.football_team.aggregate([
  { "$group": {
    "_id": "$gender",
    "count": { "$sum": 1 },
    "avg_weight": { "$avg": "$weight" }
  }},
  { "$sort": { "avg_weight": -1 } }
])

The only sorting allowed by mapReduce is solely that the key used with emit is always sorted in ascending order. But you cannot sort the aggregated result in output in any other way, without of course performing queries when output to another collection, or by working "in memory" with returned results from the server.

As a "side note" ( though an important one ), you probably should also consider in "learning" that the reality is the "server-side JavaScript" functionality of MongoDB is really a work-around more than being a feature. When MongoDB was first introduced, it applied a JavaScript engine for server execution mostly to make up for features which had not yet been implemented.

Thus to make up for the lack of the complete implementation of many query operators and aggregation functions which would come later, adding a JavaScript engine was a "quick fix" to allow certain things to be done with minimal implementation.

The result over the years is those JavaScript engine features are gradually being removed. The group() function of the API is removed. The eval() function of the API is deprecated and scheduled for removal at the next major version. The writing is basically "on the wall" for the limited future of these JavaScript on the server features, as the clear pattern is where the native features provide support for something, then the need to continue support for the JavaScript engine basically goes away.

The core wisdom here being that focusing on learning these JavaScript on the server features, is probably not really worth the time invested unless you have a pressing use case that currently cannot be solved by any other means.

MapReduce function to return two outputs. MongoDB

Answers (1)

Related Questions