how to drop duplicate embedded document

Question

I have users' collection containing many lists of sub documents. Schema is something like this:

   {
    _id: ObjectId(),
    name: aaa,
    age: 20,
    transactions:[
        {
         trans_id: 1,
         product: mobile,
         price: 30,
        },
        {
         trans_id: 2,
         product: tv,
         price: 10
        },
        ...]
    ...
   }

So I have one doubt. trans_id in transactions list is unique over all the products, but it may be possible that I may have copied the same transaction again with same trans_id (due to bad ETL programming). Now I want to drop those duplicate sub documents. I have indexed trans_id thought not unique. I read about dropDups option. But will it delete a particular duplicate exists in DB or it'll drop whole document (which I definitely don't want). If not how to do it?

PS: I am using MongoDB 2.6.6 version.

Neil Lunn · Accepted Answer

Nearest case to all we can see presented here it that now you need a way of defining the "distinct" items within the array where some items are in fact an "exact copy" of other items in the array.

The best case is to use $addToSet along with the $each modifier within a looping operation for the collection. Ideally you use the Bulk Operations API to take advantage of the reduced traffic when doing so:

var bulk = db.collection.initializeOrderedBulkOperation();
var count = 0;

// Read the docs
db.collection.find({}).forEach(function(doc) {
    // Blank the array
    bulk.find({ "_id": doc.id })
        .updateOne({ "$set": { "transactions": [] } });
    // Resend as a "set"
    bulk.find({ "_id": doc.id })
        .updateOne({ 
            "$addToSet": { 
                "trasactions": { "$each": doc.transactions }
            }
        });
    count++;

    // Execute once every 500 statements ( actually 1000 )
    if ( count % 500 == 0 ) {
        bulk.execute()
        bulk = db.collection.initializeOrderedBulkOperation();
    }
});

// If a remainder then execute the remaining stack
if ( count % 500 != 0 )
    bulk.execute();

So as long as the "duplicate" content is "entirely the same" then this approach will work. If the only thing that is actually "duplicated" is the "trans_id" field then you need an entirely different approach, since none of the "whole documents" are "duplicated" and this means you need more logic in place to do this.

how to drop duplicate embedded document

Answers (1)

Related Questions