Reputation: 96
I have a collection in Mongo with duplicates on a specific key that I need to remove all but one of. The Map Reduce solutions don't seem to make it clear how to remove all but one of the duplicates. I am using Ruby, how can I do this in a somewhat efficient way? My current solution is unbelievably slow!
I currently just iterate over an array of the duplicate keys and delete the first document that is returned but this only works if there are at most 1 duplicate document for each key and it is really slow.
dupes.each do |key|
$mongodb.collection("some_collection").remove($mongodb.collection("some_collection").find({key: key}).first)
end
Upvotes: 1
Views: 1640
Reputation: 293
I think you should use the MongoDB ensureIndex()
to remove the duplicates. For instance, in your case, you want to drop the duplicate documents give the key duplicate_key
, you can do
db.duplicate_collection.ensureIndex({'duplicate_key' : 1},{unique: true, dropDups: true})
where duplicate_collection
is the collection where your duplicate documents are. This operation will only preserve single document if there are duplicate documents give a particular key.
After the operation, if you think you want to remove the index, just do the dropIndex
operation. For details, you can search the mongodb documentation.
Upvotes: 2
Reputation: 96
A lot of solutions suggest Map Reduce (which is fast and fine) but I implemented a solution in Ruby that seems pretty fast as well and makes it easy to leave the one document from each duplicate set.
Basically you find all your duplicate keys by adding them to a hash and any time you find a duplicate key in the collection you add the id of that document to an array which you will use in a bulk removal at the end.
all_keys = {}
dupes = []
dupe_key = "some_key"
$mongodb.collection("some_collection").find.each do |doc|
all_keys[doc[dupe_key]].present? ? dupes << doc["_id"] : asins[doc[dupe_key]] = 1
end
$mongodb.collection("some_collection").remove({_id: {"$in" => dupes } })
The only issue with this method is that it potentially won't work if the total list of keys/dupe ids can't be stored in memory. The map reduce solution would probably be best at that point.
Upvotes: 0