Group and count over a start and end range

Question

If I have data in the following format:

[
  {
    _id: 1,
    startDate: ISODate("2017-01-1T00:00:00.000Z"),
    endDate: ISODate("2017-02-25T00:00:00.000Z"),
    type: 'CAR'
  },
  {
    _id: 2,
    startDate: ISODate("2017-02-17T00:00:00.000Z"),
    endDate: ISODate("2017-03-22T00:00:00.000Z"),
    type: 'HGV'
  }
]

Is it possible to retrieve data grouped by 'type', but also with a count of the type for each of month in a given date range e.g. between 2017/1/1 to 2017/4/1 would return:

[
  {
   _id: 'CAR', 
   monthCounts: [
     /*January*/
     {
       from: ISODate("2017-01-1T00:00:00.000Z"), 
       to: ISODate("2017-01-31T23:59:59.999Z"), 
       count: 1
     },
     /*February*/
     {
       from: ISODate("2017-02-1T00:00:00.000Z"), 
       to: ISODate("2017-02-28T23:59:59.999Z"), 
       count: 1
     },
     /*March*/
     {
       from: ISODate("2017-03-1T00:00:00.000Z"), 
       to: ISODate("2017-03-31T23:59:59.999Z"), 
       count: 0
     },
   ]
  },
  {
   _id: 'HGV', 
   monthCounts: [
     {
       from: ISODate("2017-01-1T00:00:00.000Z"), 
       to: ISODate("2017-01-31T23:59:59.999Z"), 
       count: 0
     },
     {
       from: ISODate("2017-02-1T00:00:00.000Z"), 
       to: ISODate("2017-02-28T23:59:59.999Z"), 
       count: 1
     },
     {
       from: ISODate("2017-03-1T00:00:00.000Z"), 
       to: ISODate("2017-03-31T23:59:59.999Z"), 
       count: 1
     },
   ]
  }
]

The returned format is not really important, but what I am trying to achieve is in a single query to retrieve a number of counts for the same grouping (one per month). The input could be simply a start and end date to report from or more likely it could be an array of the date ranges to group by.

Neil Lunn · Accepted Answer

The algorithm for this is to basically "iterate" values between the interval of the two values. MongoDB has a couple of ways to deal with this, being what has always been present with mapReduce() and with new features available to the aggregate() method.

I'm going expand on your selection to deliberately show an overlapping month since your examples did not have one. This will result in the "HGV" values appearing in "three" months of output.

{
        "_id" : 1,
        "startDate" : ISODate("2017-01-01T00:00:00Z"),
        "endDate" : ISODate("2017-02-25T00:00:00Z"),
        "type" : "CAR"
}
{
        "_id" : 2,
        "startDate" : ISODate("2017-02-17T00:00:00Z"),
        "endDate" : ISODate("2017-03-22T00:00:00Z"),
        "type" : "HGV"
}
{
        "_id" : 3,
        "startDate" : ISODate("2017-02-17T00:00:00Z"),
        "endDate" : ISODate("2017-04-22T00:00:00Z"),
        "type" : "HGV"
}

Aggregate - Requires MongoDB 3.4

db.cars.aggregate([
  { "$addFields": {
    "range": {
      "$reduce": {
        "input": { "$map": {
          "input": { "$range": [ 
            { "$trunc": { 
              "$divide": [ 
                { "$subtract": [ "$startDate", new Date(0) ] },
                1000
              ]
            }},
            { "$trunc": {
              "$divide": [
                { "$subtract": [ "$endDate", new Date(0) ] },
                1000
              ]
            }},
            60 * 60 * 24
          ]},
          "as": "el",
          "in": {
            "$let": {
              "vars": {
                "date": {
                  "$add": [ 
                    { "$multiply": [ "$$el", 1000 ] },
                    new Date(0)
                  ]
                },
                "month": {
                }
              },
              "in": {
                "$add": [
                  { "$multiply": [ { "$year": "$$date" }, 100 ] },
                  { "$month": "$$date" }
                ]
              }
            }
          }
        }},
        "initialValue": [],
        "in": {
          "$cond": {
            "if": { "$in": [ "$$this", "$$value" ] },
            "then": "$$value",
            "else": { "$concatArrays": [ "$$value", ["$$this"] ] }
          }
        }
      }
    }
  }},
  { "$unwind": "$range" },
  { "$group": {
    "_id": {
      "type": "$type",
      "month": "$range"
    },
    "count": { "$sum": 1 }
  }},
  { "$sort": { "_id": 1 } },
  { "$group": {
    "_id": "$_id.type",
    "monthCounts": { 
      "$push": { "month": "$_id.month", "count": "$count" }
    }
  }}
])

The key to making this work is the $range operator which takes values for a "start" and and "end" as well as an "interval" to apply. The result is an array of values taken from the "start" and incremented until the "end" is reached.

We use this with startDate and endDate to generate the possible dates in between those values. You will note that we need to do some math here since the $range only takes a 32-bit integer, but we can take the milliseconds away from the timestamp values so that is okay.

Because we want "months", the operations applied extract the month and year values from the generated range. We actually generate the range as the "days" in between since "months" are difficult to deal with in math. The subsequent $reduce operation takes only the "distinct months" from the date range.

The result therefore of the first aggregation pipeline stage is a new field in the document which is an "array" of all the distinct months covered between startDate and endDate. This gives an "iterator" for the rest of the operation.

By "iterator" I mean than when we apply $unwind we get a copy of the original document for every distinct month covered in the interval. This then allows the following two $group stages to first apply a grouping to the common key of "month" and "type" in order to "total" the counts via $sum, and next $group makes the key just the "type" and puts the results in an array via $push.

This gives the result on the above data:

{
        "_id" : "HGV",
        "monthCounts" : [
                {
                        "month" : 201702,
                        "count" : 2
                },
                {
                        "month" : 201703,
                        "count" : 2
                },
                {
                        "month" : 201704,
                        "count" : 1
                }
        ]
}
{
        "_id" : "CAR",
        "monthCounts" : [
                {
                        "month" : 201701,
                        "count" : 1
                },
                {
                        "month" : 201702,
                        "count" : 1
                }
        ]
}

Note that the coverage of "months" is only present where there is actual data. Whilst possible to produce zero values over a range, it requires quite a bit of wrangling to do so and is not very practical. If you want zero values then it is better to add that in post processing in the client once the results have been retrieved.

If you really have your heart set on the zero values, then you should separately query for $min and $max values, and pass these in to "brute force" the pipeline into generating the copies for each supplied possible range value.

So this time the "range" is made externally to all documents, and you then use a $cond statement into the accumulator to see if the current data is within the grouped range produced. Also since the generation is "external", we really don't need the MongoDB 3.4 operator of $range, so this can be applied to earlier versions as well:

// Get min and max separately 
var ranges = db.cars.aggregate(
 { "$group": {
   "_id": null,
   "startRange": { "$min": "$startDate" },
   "endRange": { "$max": "$endDate" }
 }}
).toArray()[0]

// Make the range array externally from all possible values
var range = [];
for ( var d = new Date(ranges.startRange.valueOf()); d <= ranges.endRange; d.setUTCMonth(d.getUTCMonth()+1)) {
  var v = ( d.getUTCFullYear() * 100 ) + d.getUTCMonth()+1;
  range.push(v);
}

// Run conditional aggregation
db.cars.aggregate([
  { "$addFields": { "range": range } },
  { "$unwind": "$range" },
  { "$group": {
    "_id": {
      "type": "$type",
      "month": "$range"
    },
    "count": { 
      "$sum": {
        "$cond": {
          "if": {
            "$and": [
              { "$gte": [
                "$range",
                { "$add": [
                  { "$multiply": [ { "$year": "$startDate" }, 100 ] },
                  { "$month": "$startDate" }
                ]}
              ]},
              { "$lte": [
                "$range",
                { "$add": [
                  { "$multiply": [ { "$year": "$endDate" }, 100 ] },
                  { "$month": "$endDate" }
                ]}
              ]}
            ]
          },
          "then": 1,
          "else": 0
        }
      }
    }
  }},
  { "$sort": { "_id": 1 } },
  { "$group": {
    "_id": "$_id.type",
    "monthCounts": { 
      "$push": { "month": "$_id.month", "count": "$count" }
    }
  }}
])

Which produces the consistent zero fills for all possible months on all groupings:

{
        "_id" : "HGV",
        "monthCounts" : [
                {
                        "month" : 201701,
                        "count" : 0
                },
                {
                        "month" : 201702,
                        "count" : 2
                },
                {
                        "month" : 201703,
                        "count" : 2
                },
                {
                        "month" : 201704,
                        "count" : 1
                }
        ]
}
{
        "_id" : "CAR",
        "monthCounts" : [
                {
                        "month" : 201701,
                        "count" : 1
                },
                {
                        "month" : 201702,
                        "count" : 1
                },
                {
                        "month" : 201703,
                        "count" : 0
                },
                {
                        "month" : 201704,
                        "count" : 0
                }
        ]
}

MapReduce

All versions of MongoDB support mapReduce, and the simple case of the "iterator" as mentioned above is handled by a for loop in the mapper. We can get output as generated up to the first $group from above by simply doing:

db.cars.mapReduce(
  function () {
    for ( var d = this.startDate; d <= this.endDate;
      d.setUTCMonth(d.getUTCMonth()+1) )
    { 
      var m = new Date(0);
      m.setUTCFullYear(d.getUTCFullYear());
      m.setUTCMonth(d.getUTCMonth());
      emit({ id: this.type, date: m},1);
    }
  },
  function(key,values) {
    return Array.sum(values);
  },
  { "out": { "inline": 1 } }
)

Which produces:

{
        "_id" : {
                "id" : "CAR",
                "date" : ISODate("2017-01-01T00:00:00Z")
        },
        "value" : 1
},
{
        "_id" : {
                "id" : "CAR",
                "date" : ISODate("2017-02-01T00:00:00Z")
        },
        "value" : 1
},
{
        "_id" : {
                "id" : "HGV",
                "date" : ISODate("2017-02-01T00:00:00Z")
        },
        "value" : 2
},
{
        "_id" : {
                "id" : "HGV",
                "date" : ISODate("2017-03-01T00:00:00Z")
        },
        "value" : 2
},
{
        "_id" : {
                "id" : "HGV",
                "date" : ISODate("2017-04-01T00:00:00Z")
        },
        "value" : 1
}

So it does not have the second grouping to compound to arrays, but we did produce the same basic aggregated output.

Group and count over a start and end range

Answers (1)

Aggregate - Requires MongoDB 3.4

MapReduce

Related Questions