user3723491
user3723491

Reputation: 165

mongodb - aggregate failed with memory error

I'm trying to find duplicates in my sharded collection using the id field, which is of this pattern -

"id" : {
        "idInner" : {
            "k1" : "v1",
            "k2" : "v2",
            "k3" : "v3",
            "k4" : "v4"
        }
}

I used the below query, but received the "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in." error, even though I used "allowDiskUse : true" in my query.

db.collection.aggregate([
  { $group: {
    _id: { id: "$id" },
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
], 
{ 
    allowDiskUse : true
});

Is there another way to get what I want, or something else I should pass in the above query? Thanks.

Upvotes: 3

Views: 5075

Answers (2)

Ajay Gupta
Ajay Gupta

Reputation: 3212

Please use allowDiskTrue in run command.

db.runCommand(
   { aggregate: "collection",
     pipeline: [
  { $group: {
    _id: { id: "$id" },
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
],
     allowDiskUse: true
   }
)

Let me know if this works for you.

Upvotes: 3

Ioannis Lalopoulos
Ioannis Lalopoulos

Reputation: 1511

Run a $match first in the pipeline to keep only documents of let's say id.idiInner.k1 that are between a range, so that you will take results for that range only. Since you are interested in duplicates on the id key, all the duplicated documents will satisfy this criteria. See how much you should narrow down that range and run it next for the next range etc. until you cover all documents.

If it is something you must do frequently, automate, by declaring the ranges, feed them in a loop, keep the duplicates of every run and merge the results in the end.

Another fast hack/trick would be to bypass the mongos and run the aggregation directly in each shard. Doing so will limit your docs roughly (assuming well balanced shards) to docs/number_of_shards and you may overcome the memory limit. In this second approach I assume that your shard key is the id key, however if it is not then this approach will not work since the same duplicated documents will be scattered among the shards.

Upvotes: 3

Related Questions