Christian Dechery
Christian Dechery

Reputation: 880

Strange results with MongoDB group aggregation

I have two different documents I would like to group with a matching field (idOrdem in this case). Both has a timestamp filed and only one has a timeElapsed field. I need all of this information to be returned in the aggregation, but the result I'm getting doesn't seem to be correct. I get null results for the timeElapsed field where there is definetly a document with this field.

What is wrong with my statement?

db.Logging.aggregate( [
 { $match : { $or : [ { "action":"resetDslVerify"}, { "action":"assia/reset/RequestQueryOs" } ] } },
 { $group : {
    _id : "$idOrdem",
    timestamp1: { $first: '$timestamp' },
    timestamp2: { $last: '$timestamp' },
    timeElapsed: { $first: '$timeElapsed' }
    }
 },
 { $sort: { timestamp: -1}  } ]
);

The unexpected result:

{ "_id" : "159251", "timestamp1" : 1483456382058, "timestamp2" : 1483456382058, "timeElapsed" : 1091 }
{ "_id" : "134601", "timestamp1" : 1482949316671, "timestamp2" : 1482949349410, "timeElapsed" : 821 }
{ "_id" : "168801", "timestamp1" : 1483560599899, "timestamp2" : 1483560564505, "timeElapsed" : null }
{ "_id" : "158901", "timestamp1" : 1483452698756, "timestamp2" : 1483452673424, "timeElapsed" : null }
{ "_id" : "135001", "timestamp1" : 1482949653229, "timestamp2" : 1482949711541, "timeElapsed" : 838 }

An example of documents with matching idOrdem with all the information I need:

s-1:PRIMARY> db.Logging.find( { $or : [ { "action":"resetDslVerify"}, { "action":"assia/reset/RequestQueryOs" } ], "idOrdem":"135001" } );
{ "_id" : ObjectId("586404155b88db1209c3f998"), "success" : true, "action" : "assia/reset/RequestQueryOs", "timestamp" : 1482949653229, "httpCode" : 200, "timeElapsed" : 838, "idOrdem" : "135001", "creator" : "TecnicoVirtual" }
{ "_id" : ObjectId("5864044f5b88db1209c3f99b"), "success" : true, "action" : "resetDslVerify", "timestamp" : 1482949711541, "terminal" : "2134599099", "httpCode" : 200, "idOrdem" : "135001", "idOrdem" : "135001", "result" : "OK", "timestamp" : 1482949711541, "isResetDslSuccess" : true, "creator" : "TecnicoVirtual" }

Upvotes: 0

Views: 207

Answers (2)

chridam
chridam

Reputation: 103475

The $first and $last accumulators return a value from the first/last document for each group respectively. Order is only defined if the documents are in a defined order but in your case you are ordering them AFTER grouping them in arbitrary order, which renders the accumulators useless as they only return the documents in that undefined order, hence getting the strange results.

To debug the pipeline, run it cumulatively adding each step and inspect the resulting documents at each step. So for example, you could start with just the $match pipeline and verify if the results at that stage only are the expected:

db.Logging.aggregate([
    { 
        "$match": {
            "action": { 
                "$in": [
                    "resetDslVerify",
                    "assia/reset/RequestQueryOs"
                ]
            }
        }
    }
])

Observe the resulting documents at this stage which should be documents that match the given criteria (shortened to use the $in operator in your case) and arbitrary ordered.

Add the next pipeline step:

db.Logging.aggregate([
    { 
        "$match": {
            "action": { 
                "$in": [
                    "resetDslVerify",
                    "assia/reset/RequestQueryOs"
                ]
            }
        }
    },
    {
        "$group": {
            "_id": "$idOrdem",
            "timestamp1": { "$first": "$timestamp" }
            "timestamp2": { "$last": "$timestamp" }
            "timeElapsed": { "$first": "$timeElapsed" }
        }
    }
])

Now this is where things get interesting, the documents from the resulting pipeline are in arbitrary order because they entered the $group phase in that order. Even placing a final $sort pipeline step will not remove the fly in the celebration ointment: it won't alter the order of the original documents, just the order of the GROUP.


The solution is to filter out the nulls in the $match step, place the $sort operator before the $group pipeline, as well as adding the other sort field which will be your group by key and the timeElapsed attribute:

db.Logging.aggregate([
    { 
        "$match": {
            "action": { 
                "$in": [
                    "resetDslVerify",
                    "assia/reset/RequestQueryOs"
                ]
            },
            "timestamp": { "$ne": null },
            "timeElapsed": { "$ne": null }
        }
    },
    { "$sort": { "idOrdem": 1, "timestamp": -1, "timeElapsed": -1 } }
    {
        "$group": {
            "_id": "$idOrdem",
            "timestamp1": { "$first": "$timestamp" }
            "timestamp2": { "$last": "$timestamp" }
            "timeElapsed": { "$first": "$timeElapsed" }
        }
    }
])

A better approach which doesn't use the $sort pipeline would be to use the $max and $min operators:

db.Logging.aggregate([
    { 
        "$match": {
            "action": { 
                "$in": [
                    "resetDslVerify",
                    "assia/reset/RequestQueryOs"
                ]
            }
        }
    },
    {
        "$group": {
            "_id": "$idOrdem",
            "timestamp1": { "$max": "$timestamp" }
            "timestamp2": { "$min": "$timestamp" }
            "timeElapsed": { "$max": "$timeElapsed" }
        }
    }
])

Upvotes: 1

felix
felix

Reputation: 9295

$first will take the value for the first document even if the field doesn't exist fr this doc. You can fix this issue like this:

db.Logging.aggregate([
   {
      $match:{
         $or:[
            {
               "action":"resetDslVerify"
            },
            {
               "action":"assia/reset/RequestQueryOs"
            }
         ]
      }
   },
   {
      $group:{
         _id:"$idOrdem",
         timestamp1:{
            $first:'$timestamp'
         },
         timestamp2:{
            $last:'$timestamp'
         },
         timeElapsed:{
            $push:"$timeElapsed"
         }
      }
   },
   {
      $project:{
         _id:1,
         timestamp1:1,
         timestamp2:1,
         timeElapsed:{
            $arrayElemAt:[
               "$timeElapsed",
               0
            ]
         }
      }
   },
   {
      $sort:{
         timestamp:-1
      }
   }
]);

output: (for the data you provided)

{ "_id" : "135901", "timestamp1" : 1482950884849, "timestamp2" : 1482950907877, "timeElapsed" : 801 }

Upvotes: 1

Related Questions