Reputation: 330
I have a collection with the field called "contact_id". In my collection I have duplicate registers with this key.
How can I remove duplicates, resulting in just one register?
I already tried:
db.PersonDuplicate.ensureIndex({"contact_id": 1}, {unique: true, dropDups: true})
But did not work, because the function dropDups
is no longer available in MongoDB 3.x
I'm using 3.2
Upvotes: 11
Views: 17540
Reputation: 61646
We can also use an $out
stage to remove duplicates from a collection by replacing the content of the collection with only one occurrence per duplicate.
For instance, to only keep one element per value of x
:
// > db.collection.find()
// { "x" : "a", "y" : 27 }
// { "x" : "a", "y" : 4 }
// { "x" : "b", "y" : 12 }
db.collection.aggregate(
{ $group: { _id: "$x", onlyOne: { $first: "$$ROOT" } } },
{ $replaceWith: "$onlyOne" }, // prior to 4.2: { $replaceRoot: { newRoot: "$onlyOne" } }
{ $out: "collection" }
)
// > db.collection.find()
// { "x" : "a", "y" : 27 }
// { "x" : "b", "y" : 12 }
This:
$group
s documents by the field defining what a duplicate is (here x
) and accumulates grouped documents by only keeping one (the $first
found) and giving it the value $$ROOT
, which is the document itself. At the end of this stage, we have something like:
{ "_id" : "a", "onlyOne" : { "x" : "a", "y" : 27 } }
{ "_id" : "b", "onlyOne" : { "x" : "b", "y" : 12 } }
$replaceWith
all existing fields in the input document with the content of the onlyOne
field we've created in the $group
stage, in order to find the original format back. At the end of this stage, we have something like:
{ "x" : "a", "y" : 27 }
{ "x" : "b", "y" : 12 }
$replaceWith
is only available starting in Mongo 4.2
. With prior versions, we can use $replaceRoot
instead:
{ $replaceRoot: { newRoot: "$onlyOne" } }
$out
inserts the result of the aggregation pipeline in the same collection. Note that $out
conveniently replaces the content of the specified collection, making this solution possible.
Upvotes: 3
Reputation: 2036
I have used this approach:
Upvotes: 0
Reputation: 745
this is a good pattern for mongod 3+ that also ensures that you will not run our of memory which can happen with really big collections. You can save this to a dedup.js file, customize it, and run it against your desired database with: mongo localhost:27017/YOURDB dedup.js
var duplicates = [];
db.runCommand(
{aggregate: "YOURCOLLECTION",
pipeline: [
{ $group: { _id: { DUPEFIELD: "$DUPEFIELD"}, dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }},
{ $match: { count: { "$gt": 1 }}}
],
allowDiskUse: true }
)
.result
.forEach(function(doc) {
doc.dups.shift();
doc.dups.forEach(function(dupId){ duplicates.push(dupId); })
})
printjson(duplicates); //optional print the list of duplicates to be removed
db.YOURCOLLECTION.remove({_id:{$in:duplicates}});
Upvotes: 5
Reputation: 8978
Yes, dropDups is gone for good. But you can definitely achieve your goal with little bit effort.
You need to first find all duplicate rows and then remove all except first.
db.dups.aggregate([{$group:{_id:"$contact_id", dups:{$push:"$_id"}, count: {$sum: 1}}},
{$match:{count: {$gt: 1}}}
]).forEach(function(doc){
doc.dups.shift();
db.dups.remove({_id : {$in: doc.dups}});
});
As you see doc.dups.shift()
will remove first _id from array and then remove all documents with remaining _ids in dups array.
script above will remove all duplicate documents.
Upvotes: 30
Reputation: 9473
maybe it be a good try to create a tmpColection, create unique index, then copy data from source, and last step will be swap names?
Other idea, I had is to get doubled indexes into array (using aggregation) and then loop thru calling the remove() method with the justOne parameter set to true or 1.
var itemsToDelete = db.PersonDuplicate.aggregate([
{$group: { _id:"$_id", count:{$sum:1}}},
{$match: {count: {$gt:1}}},
{$group: { _id:1, ids:{$addToSet:"$_id"}}}
])
and make a loop thru ids array makes this sense for you?
Upvotes: 0