Reputation: 1982
I have created map reduce function to get all documents along with their count. I need to remove all the duplicates now. How should I do it?
res = col.map_reduce(map,reduce,"my_results");
Gives output like:
{u'_id': u'http://www.hardassetsinvestor.com/features/5485-soft-commodity-q4-report-low-inventories-buoy-cocoa-growing-stocks-weigh-on-coffee-cotton-a-sugar.html', u'value': 2.0}
{u'_id': u'http://www.hardassetsinvestor.com/market-monitor-archive/5490-week-in-review-gold-a-silver-kick-off-2014-strongly-oil-a-natgas-stall.html', u'value': 2.0}
Upvotes: 1
Views: 812
Reputation: 65303
Assuming you don't care which duplicate gets removed, an easy approach is to ensure a unique index with dropDups:true
.
For example, assuming a field name of url
:
db.collection.ensureIndex( { url: 1 }, { unique: true, dropDups: true } )
Important note from the dropDups
documentation:
As in all unique indexes, if a document does not have the indexed field, MongoDB will include it in the index with a “null” value. If subsequent fields do not have the indexed field, and you have set
{dropDups: true}
, MongoDB will remove these documents from the collection when creating the index. If you combinedropDups
with thesparse
option, this index will only include documents in the index that have the value, and the documents without the field will remain in the database.
Upvotes: 2
Reputation: 43884
You would write a small application to do this, i.e. in the shell:
db.my_results.find().forEach(function(doc){
if(doc.value > 1)
db.realCollection.remove({_id: doc._id}, true);
});
The end true
makes remove only remove once
Adding Python since the above code is hard to translate:
for doc in db.my_results.find():
if doc.value > 1:
for i in range(0, doc.value):
db.realCollection.remove({'_id': doc._id}, true);
Upvotes: 0