Reputation: 11533
I have a documents collection like so:
{
"word": "foo",
"likes": 10,
"dislikes": 1,
},
{
"word": "foo",
"likes": 5,
"dislikes": 9,
},
The trouble is, my collection is riddled with similar documents (sharing the same word, but different data). I would like to remove these similar, almost duplicate entries.
Now, an easy way would be to use unique index:
db.entries.ensureIndex({'word' : 1}, {unique : true, dropDups : true})
But I feel like I can do better. Maybe I can use likes/dislikes data to calculate the ratio and keep only the best entries, while removing the rest.
I was wondering if this is possible to do with MapReduce and Mongo CLI Javascript magic, or should I solve this problem programatically using MongoDB primitives?
Edit: This cleanup is a 1-time event, and performance doesn't matter.
Upvotes: 1
Views: 128
Reputation: 3734
db.entries.aggregate(
[
{$group:{_id:'$word',
entries:{'$push':
{score:{ $divide: [ "$$ROOT.likes", "$$ROOT.dislikes" ]},
_id:"$$ROOT._id"}
}
}
}
,{$unwind: '$entries'},
{$sort: {'entries.score': -1}} ,
{$group: {_id: '$_id', 'entries': {$push: '$$ROOT.entries'}}}
])
Handle the case when dislikes are 0. Maybe you can use $$ROOT.dislikes+1
I don't know how output is taken in the Javascript CLI. I assume that docs
is the output.
var duplicate_ids = [];
docs.forEach(function(doc){
for(var i=1;i<doc.entries.length;i++){
duplicate_ids.push(doc.entres._id);
}
});
db.entries.remove({_id:{'$in':duplicate_ids}})
This should solve your problem.
Upvotes: 3