Reputation: 5357
I have a mongodb collection with a structure like
[
{
name: "name1",
instances: [{value:1, score:2, date:<ISODate>},
{value:2, score:5, date:<ISODate>},
{value:2.5, score:9, date:<ISODate>},
...]
},
{
name: "name2",
instances: [{value:6, score:3, date:<ISODate>},
{value:1, score:6, date:<ISODate>},
{value:3.7, score:5.2, date:<ISODate>},
...]
},
...
]
I want to find if there are two (or more) instances of the same name
where the instance's dates are from the same day, and return those instances.
Later on I would like to delete all but one of those instances, but as a start I want to be able to find them.
I tried aggregation and grouping by date, but couldn't figure out how to compare only the day (not the entire date).
Upvotes: 1
Views: 266
Reputation: 103375
Suppose you have the following test documents inserted in the test collection for demonstration purposes:
db.test.insert([
{
"name" : "name1",
"instances" : [
{
"value" : 1,
"score" : 2,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 2,
"score" : 5,
"date" : ISODate("2015-04-01T00:00:00.000Z")
},
{
"value" : 2.5,
"score" : 9,
"date" : ISODate("2015-03-05T00:00:00.000Z")
}
]
},
{
"name" : "name2",
"instances" : [
{
"value" : 6,
"score" : 3,
"date" : ISODate("2015-03-05T00:00:00.000Z")
},
{
"value" : 1,
"score" : 6,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 3.7,
"score" : 5.2,
"date" : ISODate("2015-02-04T00:00:00.000Z")
}
]
},
{
"name" : "name1",
"instances" : [
{
"value" : 6,
"score" : 3,
"date" : ISODate("2015-03-05T00:00:00.000Z")
},
{
"value" : 1,
"score" : 6,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 3.7,
"score" : 5.2,
"date" : ISODate("2015-02-04T00:00:00.000Z")
}
]
}
])
then the following aggregation would do the job:
var pipeline = aggregate([
{
"$unwind": "$instances"
},
{
"$group": {
"_id": {
"name": "$name",
"year": {
"$year": "$instances.date"
},
"month": {
"$month": "$instances.date"
},
"day": {
"$dayOfYear": "$instances.date"
}
},
"count": {
"$sum": 1
},
"data": {
"$addToSet": "$$ROOT"
}
}
},
{
"$match": {
"count": {
"$gt": 1
}
}
},
{
"$unwind": "$data"
},
{
"$group": {
"_id": {
"name": "$data.name",
"_id": "$data._id"
}
}
},
{
"$project": {
"_id": "$_id._id",
"name": "$_id.name"
}
}
]);
db.test.aggregate(pipeline);
Output:
/* 0 */
{
"result" : [
{
"_id" : ObjectId("55506d0a180e849972939056"),
"name" : "name1"
},
{
"_id" : ObjectId("55506d0a180e849972939058"),
"name" : "name1"
}
],
"ok" : 1
}
The above aggregation pipeline has an $unwind
operation as the first step which deconstructs the instances
array field from the input documents to output a document for each element. Each output document replaces the array with an element value.
The next pipeline stage $group
groups the documents by the "name"
, "instances.date"
fields (the date field is split into three fields by using the Date Aggregation Operators), calculates the count
field for each group, and outputs a document for each unique name
and date
(down to the day part). There is an extra array field in the group data
, which uses the system variable $$ROOT
to store the original root document, i.e. the top-level document, currently being processed in the aggregation pipeline stage. This root document is added to the array by using the $addToSet
array operator.
Further down the pipeline you will then need to filter those documents which are duplicates when grouped by name and date through the use of a $match
pipeline with the specified criteria that the count should be greater than one.
Another $unwind
operation is then applied on the data
field to extract the actual _id
and name
of the duplicates which would be grouped again to further streamline your documents.
An additional $project
pipeline stage would be necessary to shape your final documents structure by modifying the fields.
Use the aggregation result cursor to then iterate over the results using the forEach()
method and remove the other duplicate documents:
var cur = db.test.aggregate(pipeline);
cur.forEach(function (doc){
var count = 0;
if (count != 0){
db.test.remove({"_id": doc._id});
}
count++;
});
Another option is to include an $out
operator as a final pipeline stage that writes the documents returned by the aggregation pipeline to a specified collection that you can then query on and do the deletions:
var cur = db.outputcollection.find();
cur.forEach(function (doc){
var count = 0;
if (count != 0){
db.test.remove({"_id": doc._id});
}
count++;
});
Upvotes: 4
Reputation: 52000
If I understand it well you should $unwind
and then $group
by date and instance, filtering out groups of only one document. Something like that (I don't have access to MongoDB right now -- beware of the typos):
db.coll.aggregate([
{$unwind: "$instances"},
{$group: { _id: { name:"$name", day:{$dayOfYear:"$date"}, year:{$year:"$date"}}, count: {$sum: 1} }},
{$match: {count: {$gt: 1}}}
])
Upvotes: 3