Musicalmindz
Musicalmindz

Reputation: 96

How do I find and remove duplicate mongo documents with ruby

I have a collection in Mongo with duplicates on a specific key that I need to remove all but one of. The Map Reduce solutions don't seem to make it clear how to remove all but one of the duplicates. I am using Ruby, how can I do this in a somewhat efficient way? My current solution is unbelievably slow!

I currently just iterate over an array of the duplicate keys and delete the first document that is returned but this only works if there are at most 1 duplicate document for each key and it is really slow.

dupes.each do |key|
    $mongodb.collection("some_collection").remove($mongodb.collection("some_collection").find({key: key}).first)
end

Upvotes: 1

Views: 1640

Answers (2)

Hotloo Xiranood
Hotloo Xiranood

Reputation: 293

I think you should use the MongoDB ensureIndex() to remove the duplicates. For instance, in your case, you want to drop the duplicate documents give the key duplicate_key, you can do

db.duplicate_collection.ensureIndex({'duplicate_key' : 1},{unique: true, dropDups: true})

where duplicate_collection is the collection where your duplicate documents are. This operation will only preserve single document if there are duplicate documents give a particular key.

After the operation, if you think you want to remove the index, just do the dropIndex operation. For details, you can search the mongodb documentation.

Upvotes: 2

Musicalmindz
Musicalmindz

Reputation: 96

A lot of solutions suggest Map Reduce (which is fast and fine) but I implemented a solution in Ruby that seems pretty fast as well and makes it easy to leave the one document from each duplicate set.

Basically you find all your duplicate keys by adding them to a hash and any time you find a duplicate key in the collection you add the id of that document to an array which you will use in a bulk removal at the end.

all_keys = {}
dupes = []
    dupe_key = "some_key"

$mongodb.collection("some_collection").find.each do |doc|
   all_keys[doc[dupe_key]].present? ? dupes << doc["_id"] : asins[doc[dupe_key]] = 1
end

$mongodb.collection("some_collection").remove({_id: {"$in" => dupes } })

The only issue with this method is that it potentially won't work if the total list of keys/dupe ids can't be stored in memory. The map reduce solution would probably be best at that point.

Upvotes: 0

Related Questions