if __name__ is None
if __name__ is None

Reputation: 11533

MongoDB: Conditionally drop duplicates

I have a documents collection like so:

{
    "word": "foo",
    "likes": 10,
    "dislikes": 1,
},
{
    "word": "foo",
    "likes": 5,
    "dislikes": 9,
},

The trouble is, my collection is riddled with similar documents (sharing the same word, but different data). I would like to remove these similar, almost duplicate entries.

Now, an easy way would be to use unique index:

db.entries.ensureIndex({'word' : 1}, {unique : true, dropDups : true})

But I feel like I can do better. Maybe I can use likes/dislikes data to calculate the ratio and keep only the best entries, while removing the rest.

I was wondering if this is possible to do with MapReduce and Mongo CLI Javascript magic, or should I solve this problem programatically using MongoDB primitives?

Edit: This cleanup is a 1-time event, and performance doesn't matter.

Upvotes: 1

Views: 128

Answers (1)

ma08
ma08

Reputation: 3734

db.entries.aggregate(
            [
              {$group:{_id:'$word',
                       entries:{'$push':
                                   {score:{ $divide: [ "$$ROOT.likes", "$$ROOT.dislikes" ]},
                                    _id:"$$ROOT._id"}
                                   }
                               }
                       }
             ,{$unwind: '$entries'}, 
              {$sort: {'entries.score': -1}} ,
              {$group: {_id: '$_id', 'entries': {$push: '$$ROOT.entries'}}}
           ])

Handle the case when dislikes are 0. Maybe you can use $$ROOT.dislikes+1 I don't know how output is taken in the Javascript CLI. I assume that docs is the output.

var duplicate_ids = [];
docs.forEach(function(doc){
    for(var i=1;i<doc.entries.length;i++){
       duplicate_ids.push(doc.entres._id);
     }
});
db.entries.remove({_id:{'$in':duplicate_ids}})

This should solve your problem.

Upvotes: 3

Related Questions