MongoDB: find subdocument instances with similar date

Question

I have a mongodb collection with a structure like

[
  {
    name: "name1",
    instances: [{value:1, score:2, date:},
                {value:2, score:5, date:},
                {value:2.5, score:9, date:},
                ...]
  },
  {
    name: "name2",
    instances: [{value:6, score:3, date:},
                {value:1, score:6, date:},
                {value:3.7, score:5.2, date:},
                ...]
  },
  ...
]

I want to find if there are two (or more) instances of the same name where the instance's dates are from the same day, and return those instances.

Later on I would like to delete all but one of those instances, but as a start I want to be able to find them.

I tried aggregation and grouping by date, but couldn't figure out how to compare only the day (not the entire date).

chridam · Accepted Answer

Suppose you have the following test documents inserted in the test collection for demonstration purposes:

db.test.insert([
{
    "name" : "name1",
    "instances" : [ 
        {
            "value" : 1,
            "score" : 2,
            "date" : ISODate("2015-03-04T00:00:00.000Z")
        }, 
        {
            "value" : 2,
            "score" : 5,
            "date" : ISODate("2015-04-01T00:00:00.000Z")
        }, 
        {
            "value" : 2.5,
            "score" : 9,
            "date" : ISODate("2015-03-05T00:00:00.000Z")
        }
    ]
},
{
    "name" : "name2",
    "instances" : [ 
        {
            "value" : 6,
            "score" : 3,
            "date" : ISODate("2015-03-05T00:00:00.000Z")
        }, 
        {
            "value" : 1,
            "score" : 6,
            "date" : ISODate("2015-03-04T00:00:00.000Z")
        }, 
        {
            "value" : 3.7,
            "score" : 5.2,
            "date" : ISODate("2015-02-04T00:00:00.000Z")
        }
    ]
},
{
    "name" : "name1",
    "instances" : [ 
        {
            "value" : 6,
            "score" : 3,
            "date" : ISODate("2015-03-05T00:00:00.000Z")
        }, 
        {
            "value" : 1,
            "score" : 6,
            "date" : ISODate("2015-03-04T00:00:00.000Z")
        }, 
        {
            "value" : 3.7,
            "score" : 5.2,
            "date" : ISODate("2015-02-04T00:00:00.000Z")
        }
    ]
}
])

then the following aggregation would do the job:

var pipeline = aggregate([
    {
        "$unwind": "$instances"
    },
    {
        "$group": {
            "_id": {
                "name": "$name",
                "year": {
                    "$year": "$instances.date"
                },
                "month": {
                    "$month": "$instances.date"
                },
                "day": {
                    "$dayOfYear": "$instances.date"
                }
            },
            "count": {
                "$sum": 1
            },
            "data": {
                "$addToSet": "$$ROOT"
            }
        }
    },
    {
        "$match": {
            "count": {
                "$gt": 1
            }
        }
    },
    {
        "$unwind": "$data"
    },
    {
        "$group": {
            "_id": {
                "name": "$data.name",
                "_id": "$data._id"
            }
        }
    },
    {
        "$project": {
            "_id": "$_id._id",
            "name": "$_id.name"
        }
    }
]);
db.test.aggregate(pipeline);

Output:

/* 0 */
{
    "result" : [ 
        {
            "_id" : ObjectId("55506d0a180e849972939056"),
            "name" : "name1"
        }, 
        {
            "_id" : ObjectId("55506d0a180e849972939058"),
            "name" : "name1"
        }
    ],
    "ok" : 1
}

The above aggregation pipeline has an $unwind operation as the first step which deconstructs the instances array field from the input documents to output a document for each element. Each output document replaces the array with an element value.

The next pipeline stage $group groups the documents by the "name", "instances.date" fields (the date field is split into three fields by using the Date Aggregation Operators), calculates the count field for each group, and outputs a document for each unique name and date (down to the day part). There is an extra array field in the group data, which uses the system variable $$ROOT to store the original root document, i.e. the top-level document, currently being processed in the aggregation pipeline stage. This root document is added to the array by using the $addToSet array operator.

Further down the pipeline you will then need to filter those documents which are duplicates when grouped by name and date through the use of a $match pipeline with the specified criteria that the count should be greater than one.

Another $unwind operation is then applied on the data field to extract the actual _id and name of the duplicates which would be grouped again to further streamline your documents.

An additional $project pipeline stage would be necessary to shape your final documents structure by modifying the fields.

Use the aggregation result cursor to then iterate over the results using the forEach() method and remove the other duplicate documents:

var cur = db.test.aggregate(pipeline);
cur.forEach(function (doc){
    var count = 0;
    if (count != 0){
        db.test.remove({"_id": doc._id});
    }
    count++;
});

Another option is to include an $out operator as a final pipeline stage that writes the documents returned by the aggregation pipeline to a specified collection that you can then query on and do the deletions:

var cur = db.outputcollection.find();
cur.forEach(function (doc){
    var count = 0;
    if (count != 0){
        db.test.remove({"_id": doc._id});
    }
    count++;
});

MongoDB: find subdocument instances with similar date

Answers (2)

Related Questions