Reputation: 165
I'm trying to find duplicates in my sharded collection using the id field, which is of this pattern -
"id" : {
"idInner" : {
"k1" : "v1",
"k2" : "v2",
"k3" : "v3",
"k4" : "v4"
}
}
I used the below query, but received the "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in." error, even though I used "allowDiskUse : true" in my query.
db.collection.aggregate([
{ $group: {
_id: { id: "$id" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
],
{
allowDiskUse : true
});
Is there another way to get what I want, or something else I should pass in the above query? Thanks.
Upvotes: 3
Views: 5075
Reputation: 3212
Please use allowDiskTrue in run command.
db.runCommand(
{ aggregate: "collection",
pipeline: [
{ $group: {
_id: { id: "$id" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
],
allowDiskUse: true
}
)
Let me know if this works for you.
Upvotes: 3
Reputation: 1511
Run a $match
first in the pipeline to keep only documents of let's say id.idiInner.k1
that are between a range, so that you will take results for that range only. Since you are interested in duplicates on the id
key, all the duplicated documents will satisfy this criteria. See how much you should narrow down that range and run it next for the next range etc. until you cover all documents.
If it is something you must do frequently, automate, by declaring the ranges, feed them in a loop, keep the duplicates of every run and merge the results in the end.
Another fast hack/trick would be to bypass the mongos and run the aggregation directly in each shard. Doing so will limit your docs roughly (assuming well balanced shards) to docs/number_of_shards and you may overcome the memory limit. In this second approach I assume that your shard key is the id key, however if it is not then this approach will not work since the same duplicated documents will be scattered among the shards.
Upvotes: 3