Aggregation Timing Out

Question

I am using aggregates to query my schema for counts over date ranges, my problem is i am not getting any response from the server (Times out everytime), other mongoose queries are working fine (find, save, etc.) and when i call aggregates it depends on the pipeline (when i only use match i get a response when i add unwind i don't get any).

Connection Code:

var promise = mongoose.connect('mongodb://:@.mlab.com:/', {
  useMongoClient: true,
  replset: {
    ha: true, // Make sure the high availability checks are on
    haInterval: 5000 // Run every 5 seconds
  }
});

promise.then(function(db){
  console.log('DB Connected');
}).catch(function(e){
  console.log('DB Not Connected');
  console.errors(e.message);
  process.exit(1);
});

Schema:

var ProspectSchema = new Schema({
  contact_name: {
    type: String,
    required: true
  },
  company_name: {
    type: String,
    required: true
  },
  contact_info: {
    type: Array,
    required: true
  },
  description:{
    type: String,
    required: true
  },
  product:{
    type: Schema.Types.ObjectId, ref: 'Product'
  },
  progression:{
    type: String
  },
  creator:{
    type: String
  },
  sales: {
    type: Schema.Types.ObjectId,
    ref: 'User'
  },
  technical_sales: {
    type: Schema.Types.ObjectId,
    ref: 'User'
  },
  actions: [{
    type: {type: String},
    description: {type: String},
    date: {type: Date}
  }],
  sales_connect_id: {
    type: String
  },
  date_created: {
    type: Date,
    default: Date.now
  }
});

Aggregation code:

exports.getActionsIn = function(start_date, end_date) {
  var start = new Date(start_date);
  var end = new Date(end_date);

  return Prospect.aggregate([
    {
      $match: {
        // "actions": {
        //   $elemMatch: {
        //     "type": {
        //       "$exists": true
        //     }
        //   }
        // }
        "actions.date": {
          $gte: start,
          $lte: end
        }
      }
    }
    ,{
      $project: {
        _id: 0,
        actions: 1
      }
    }
    ,{
      $unwind:  "actions"
    }
    ,{
      $group: {
        _id: "actions.date",
        count: {
          $sum: 1
        }
      }
    }
    // ,{
    //   $project: {
    //     _id: 0,
    //     date: {
    //       $dateToString: {
    //         format: "%d/%m/%Y",
    //         date: "actions.date"
    //       }
    //     }
    //     // ,
    //     // count : "$count"
    //   }
    // }
  ]).exec();
}

Calling the Aggregation:

router.get('/test',function(req, res, next){
  var start_date = req.query.start_date;
  var end_date = req.query.end_date;
  ProspectCont.getActionsIn(start_date,end_date).then(function(value, err){
    if(err)console.log(err);
    res.json(value);
  });
})

My Main Problem is that i get no response at all, i can work with an error message the issue is i am not getting any so i don't know what is wrong.

Mongoose Version: 4.11.8

P.s. I tried multiple variations of the aggregation pipeline, so this isn't my first try, i have an aggregation working on the main prospects schema but not the actions sub-document

Neil Lunn · Accepted Answer

You have several problems here, mostly by missing concepts. Lazy readers can skip to the bottom for the full pipeline example, but the main body here is in the explanation of why things are done as they are.

You are trying to select on a date range. The very first thing to check on any long running operation is that you have a valid index. You might have one, or you might not. But you should issue: ( from the shell )
```
db.prospects.createIndex({ "actions.date": 1 })
```
Just to be sure. You probably really should add this to the schema definition so you know this should be deployed. So add to your defined schema:
```
ProspectSchema.index({ "actions.date": 1 })
```
When querying with a "range" on elements of an array, you need to understand that those are "multiple conditions" which you are expecting to match elements "between". Whilst you generally can get away with querying a "single property" of an array using "Dot Notation", you are missing that the application of [$gte][1] and $lte is like specifying the property several times with $and explicitly.

Whenever you have such "multiple conditions" you always mean to use $elemMatch. Without it, you are simply testing every value in the array to see if it is greater than or less than ( being some may be greater and some may be lesser ). The $elemMatch operator makes sure that "both" are applied to the same "element", and not just all array values as "Dot notation" exposes them:
```
{ "$match": {
  "actions": {
    "$elemMatch": { "date": { "$gte": start, "$lte: end } }
  }
}}
```
That will now only match documents where the "array elements" fall between the specified date. Without it, you are selecting and processing a lot more data which is irrelevant to the selection.
Array Filtering: Marked in Bold because it's prominence cannot be ignored. Any initial $match works just like any "query" in that it's "job" is to "select documents" valid to the expression. This however does not have any effect on the contents of the array in the documents returned.

Whenever you have such a condition for document selection, you nearly always intend to "filter" such content from the array itself. This is a separate process, and really should be performed before any other operations that work with the content. Especially [$unwind][4].

So you really should add a $filter in either an $addFields or $project as is appropriate to your intend "immediately" following any document selection:
```
{ "$project": {
  "_id": 0,
  "actions": {
    "$filter": {
      "input": "$actions",
      "as": "a",
      "in": {
        "$and": [
          { "$gte": [ "$$a.date", start ] },
          { "$lte": [ "$$a.date", end ] }
        ]
      }
    }
  }
}}
```
Now the array content, which you already know "must" have contained at least one valid item due to the initial query conditions, is "reduced" down to only those entries that actually match the date range that you want. This removes a lot of overhead from later processing.

Note the different "logical variants" of $gte and $lte in use within the $filter condition. These evaluate to return a boolean for expressions that require them.
Grouping It's probably just as an attempt at getting a result, but the code you have does not really do anything with the dates in question. Since typical date values should be provided with millisecond precision, you general want to reduce them.

Commented code suggests usage of $dateToString within a $project. It is strongly recommended that you do not do that. If you intend such a reduction, then supply that expression directly to the grouping key within $group instead:
```
{ "$group": {
  "_id": {
    "$dateToString": {
      "format": "%Y-%m-%d",
      "date": "$actions.date"
    }
  },
  "count": { "$sum": 1 }
}}
```
I personally don't like returning a "string" when a natural Date object serializes properly for me already. So I like to use the "math" approach to "round" dates instead:
```
{ "$group": {
  "_id": {
    "$add": [
      { "$subtract": [
        { "$subtract": [ "$actions.date", new Date(0) ] },
        { "$mod": [
          { "$subtract": [ "$actions.date", new Date(0) ] },
          1000 * 60 * 60 * 24
        ]}
      ],
      new Date(0)
    ]
  },
  "count": { "$sum": 1 }
}}
```
That returns a valid Date object "rounded" to the current day. Mileage may vary on preferred approaches, but it's the one I like. And it takes the least bytes to transfer.

The usage of Date(0) represents the "epoch date". So when you $subtract one BSON Date from another you end up with the milliseconds difference between the two as an integer. When $add an integer value to a BSON Date, you get a new BSON Date representing the sum of the milliseconds value between the two. This is the basis of converting to numeric, rounding to the nearest start of day, and then converting numeric back to a Date value.

By making that statement directly within the $group rather than $project, you are basically saving what actually gets interpreted as "go through all the data and return this calculated value, then go and do...". Much the same as working through a pile of objects, marking them with a pen first and then actually counting them as a separate step.

As a single pipeline stage it saves considerable resources as you do the accumulation at the same time as calculating the value to accumulate on. When you think it though much like the provided analogy, it just makes a lot of sense.

As a full pipeline example you would put the above together as:

Prospect.aggregate([
    { "$match": {
      "actions": {
        "$elemMatch": { "date": { "$gte": start, "$lte: end } }
      }
    }},
    { "$project": {
      "_id": 0,
      "actions": {
        "$filter": {
          "input": "$actions",
          "as": "a",
          "in": {
            "$and": [
              { "$gte": [ "$$a.date", start ] },
              { "$lte": [ "$$a.date", end ] }
            ]
          }
        }
      }
    }},
    { "$unwind": "$actions" },
    { "$group": {
      "_id": {
        "$dateToString": {
          "format": "%Y-%m-%d",
          "date": "$actions.date"
        }
      },
      "count": { "$sum": 1 }
    }}
])

And honestly if after making sure an index is in place, and following that pipeline you still have timeout problems, then reduce the date selection down until you get a reasonable response time.

If it's still taking too long ( or the date reduction is not reasonable ) then your hardware simply is not up to the task. If you really have a lot of data then you have to be reasonable with expectations. So scale up or scale out, but those things are outside the scope of any question here.

As it stands those improvements should make a significant difference over any attempt shown so far. Mostly due to a few fundamental concepts that are being missed.

Aggregation Timing Out

Answers (1)

Related Questions