harshad
harshad

Reputation: 430

how to drop duplicate embedded document

I have users' collection containing many lists of sub documents. Schema is something like this:

   {
    _id: ObjectId(),
    name: aaa,
    age: 20,
    transactions:[
        {
         trans_id: 1,
         product: mobile,
         price: 30,
        },
        {
         trans_id: 2,
         product: tv,
         price: 10
        },
        ...]
    ...
   }

So I have one doubt. trans_id in transactions list is unique over all the products, but it may be possible that I may have copied the same transaction again with same trans_id (due to bad ETL programming). Now I want to drop those duplicate sub documents. I have indexed trans_id thought not unique. I read about dropDups option. But will it delete a particular duplicate exists in DB or it'll drop whole document (which I definitely don't want). If not how to do it?

PS: I am using MongoDB 2.6.6 version.

Upvotes: 0

Views: 544

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151122

Nearest case to all we can see presented here it that now you need a way of defining the "distinct" items within the array where some items are in fact an "exact copy" of other items in the array.

The best case is to use $addToSet along with the $each modifier within a looping operation for the collection. Ideally you use the Bulk Operations API to take advantage of the reduced traffic when doing so:

var bulk = db.collection.initializeOrderedBulkOperation();
var count = 0;

// Read the docs
db.collection.find({}).forEach(function(doc) {
    // Blank the array
    bulk.find({ "_id": doc.id })
        .updateOne({ "$set": { "transactions": [] } });
    // Resend as a "set"
    bulk.find({ "_id": doc.id })
        .updateOne({ 
            "$addToSet": { 
                "trasactions": { "$each": doc.transactions }
            }
        });
    count++;

    // Execute once every 500 statements ( actually 1000 )
    if ( count % 500 == 0 ) {
        bulk.execute()
        bulk = db.collection.initializeOrderedBulkOperation();
    }
});

// If a remainder then execute the remaining stack
if ( count % 500 != 0 )
    bulk.execute();

So as long as the "duplicate" content is "entirely the same" then this approach will work. If the only thing that is actually "duplicated" is the "trans_id" field then you need an entirely different approach, since none of the "whole documents" are "duplicated" and this means you need more logic in place to do this.

Upvotes: 2

Related Questions