blackmamba
blackmamba

Reputation: 1982

Map reduce to delete duplicates (mongodb)

I have created map reduce function to get all documents along with their count. I need to remove all the duplicates now. How should I do it?

 res = col.map_reduce(map,reduce,"my_results");

Gives output like:

{u'_id': u'http://www.hardassetsinvestor.com/features/5485-soft-commodity-q4-report-low-inventories-buoy-cocoa-growing-stocks-weigh-on-coffee-cotton-a-sugar.html', u'value': 2.0}
{u'_id': u'http://www.hardassetsinvestor.com/market-monitor-archive/5490-week-in-review-gold-a-silver-kick-off-2014-strongly-oil-a-natgas-stall.html', u'value': 2.0}

Upvotes: 1

Views: 812

Answers (2)

Stennie
Stennie

Reputation: 65303

Assuming you don't care which duplicate gets removed, an easy approach is to ensure a unique index with dropDups:true.

For example, assuming a field name of url:

db.collection.ensureIndex( { url: 1 }, { unique: true, dropDups: true } )

Important note from the dropDups documentation:

As in all unique indexes, if a document does not have the indexed field, MongoDB will include it in the index with a “null” value. If subsequent fields do not have the indexed field, and you have set {dropDups: true}, MongoDB will remove these documents from the collection when creating the index. If you combine dropDups with the sparse option, this index will only include documents in the index that have the value, and the documents without the field will remain in the database.

Upvotes: 2

Sammaye
Sammaye

Reputation: 43884

You would write a small application to do this, i.e. in the shell:

db.my_results.find().forEach(function(doc){
    if(doc.value > 1)
        db.realCollection.remove({_id: doc._id}, true);
});

The end true makes remove only remove once

Edit

Adding Python since the above code is hard to translate:

for doc in db.my_results.find():
    if doc.value > 1:
        for i in range(0, doc.value):
            db.realCollection.remove({'_id': doc._id}, true);

Upvotes: 0

Related Questions