Michael
Michael

Reputation: 2566

How to do a massive random update with MongoDB / NodeJS

I have a mongoDB collection with more then 1000000 documents and i would like to update each document one by one with a dedicated information (each doc has an information coming from an other collection).

Currently i'm using a cursor that fetch all the data from the collection and i do an update of each records through the async module of Node.js

Fetch all docs :

inst.db.collection(association.collection, function(err, collection) {
    collection.find({}, {}, function(err, cursor) {
        cursor.toArray(function(err, items){
                 ......
        );
    });
});

update each doc :

items.forEach(function(item) {
    // *** do some stuff with item, add field etc.
    tasks.push(function(nextTask) {
       inst.db.collection(association.collection, function(err, collection) {
           if (err) callback(err, null);
           collection.save(item, nextTask);
       });
    });
});

call the "save" task in parallel

async.parallel(tasks, function(err, results) {
    callback(err, results);
});

Ho would you do this type of operation in a more efficient way? I mean how to avoid the initial "find" to load a cursor. Is there now way to do an operation doc by doc knowing that all docs should be updated?

Thanks for your support.

Upvotes: 2

Views: 1129

Answers (1)

mjhm
mjhm

Reputation: 16705

You're question inspired me to create a Gist to do some performance testing of different approaches to your problem.

Here are the results running on a small EC2 instance with the MongoDB at localhost. The test scenario is to uniquely operate on every document of a 100000 element collection.

  1. 108.661 seconds -- Uses find().toArray to pull in all the items at once then replaces the documents with individual "save" calls.
  2. 99.645 seconds -- Uses find().toArray to pull in all the items at once then updates the documents with individual "update" calls.
  3. 74.553 seconds -- Iterates on the cursor (find().each) with batchSize = 10, then uses individual update calls.
  4. 58.673 seconds -- Iterates on the cursor (find().each) with batchSize = 10000, then uses individual update calls.
  5. 4.727 seconds -- Iterates on the cursor with batchSize = 10000, and does inserts into a new collection 10000 items at a time.

Though not included, I also did a test with MapReduce used as a server side filter which ran at about 19 seconds. I would have liked to have similarly used "aggregate" as a server side filter, but it doesn't yet have an option to output to a collection.

The bottom line answer is that if you can get away with it, the fastest option is to pull items from an initial collection via a cursor, update them locally and insert them into a new collection in big chunks. Then you can swap in the new collection for the old.

If you need to keep the database active, then the best option is to use a cursor with a big batchSize, and update the documents in place. The "save" call is slower than "update" because it needs to replace whole document, and probably needs to reindex it as well.

Upvotes: 1

Related Questions