Reputation: 430
I have users' collection containing many lists of sub documents. Schema is something like this:
{
_id: ObjectId(),
name: aaa,
age: 20,
transactions:[
{
trans_id: 1,
product: mobile,
price: 30,
},
{
trans_id: 2,
product: tv,
price: 10
},
...]
...
}
So I have one doubt. trans_id
in transactions
list is unique over all the products, but it may be possible that I may have copied the same transaction again with same trans_id
(due to bad ETL programming). Now I want to drop those duplicate sub documents. I have indexed trans_id thought not unique
. I read about dropDups
option. But will it delete a particular duplicate exists in DB or it'll drop whole document (which I definitely don't want). If not how to do it?
PS: I am using MongoDB 2.6.6 version.
Upvotes: 0
Views: 544
Reputation: 151122
Nearest case to all we can see presented here it that now you need a way of defining the "distinct" items within the array where some items are in fact an "exact copy" of other items in the array.
The best case is to use $addToSet
along with the $each
modifier within a looping operation for the collection. Ideally you use the Bulk Operations API to take advantage of the reduced traffic when doing so:
var bulk = db.collection.initializeOrderedBulkOperation();
var count = 0;
// Read the docs
db.collection.find({}).forEach(function(doc) {
// Blank the array
bulk.find({ "_id": doc.id })
.updateOne({ "$set": { "transactions": [] } });
// Resend as a "set"
bulk.find({ "_id": doc.id })
.updateOne({
"$addToSet": {
"trasactions": { "$each": doc.transactions }
}
});
count++;
// Execute once every 500 statements ( actually 1000 )
if ( count % 500 == 0 ) {
bulk.execute()
bulk = db.collection.initializeOrderedBulkOperation();
}
});
// If a remainder then execute the remaining stack
if ( count % 500 != 0 )
bulk.execute();
So as long as the "duplicate" content is "entirely the same" then this approach will work. If the only thing that is actually "duplicated" is the "trans_id" field then you need an entirely different approach, since none of the "whole documents" are "duplicated" and this means you need more logic in place to do this.
Upvotes: 2