Shih-Min Lee
Shih-Min Lee

Reputation: 9700

Node insert large data using mongoose

I am trying to insert large data sets into mongodb with mongoose. But before that I need to make sure my for loop is working correctly.

// basic schema settings
var mongoose = require('mongoose');
var Schema = mongoose.Schema;
var TempcolSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});
TempcolSchema.index({
  'loc': "sphere2d"
});


// we can easily see from the output that the forloop runs correctly
mongoose.connect('mongodb://localhost/mean-dev', function(err){
    for (var i = 0; i < 10000000; i++) {
        var c = i;
        console.log(c);
    }
});

the output is 1,2,3,4,....etc

now I want to add a mongoose save statement into the for loop.

mongoose.connect('mongodb://localhost/mean-dev', function(err){
    var Tempcol = mongoose.model('Tempcol', TempcolSchema);

    for (var i = 0; i < 10000000; i++) {
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c%100000;
        new Tempcol({cid: Math.random(), loc: [lon, lat]}).save(function(err){});
    }
});

the output is still 1,2,3,4,..... However the for loop stops after a while and saying the maximum stack is reached and have some kind of memory problem. Also when I check the collection I realized there are no data points being inserted at all.

So does anyone know what might be happening? Thanks.

Upvotes: 1

Views: 2182

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151092

The problem here is that the loop you are running is not waiting for each operation to complete. So in fact you are just queuing up 1000's of .save() requests and trying to run them concurrently. You can't do that within reasonable limitations, hence you get the error response.

The async module has various methods for iterating while processing a callback for that iterator, where probably the most simple direct for is whilst. Mongoose also handles the connection management for you without needing to embed within the callback, as the models are connection aware:

var tempColSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});

var TempCol = mongoose.model( "TempCol", tempColSchema );

mongoose.connect( 'mongodb://localhost/mean-dev' );

var i = 0;
async.whilst(
    function() { return i < 10000000; },
    function(callback) {
        i++;
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c%100000;
        new Tempcol({cid: Math.random(), loc: [lon, lat]}).save(function(err){
            callback(err);
        });            
    },
    function(err) {
       // When the loop is complete or on error
    }
);

Not the most fantastic way to do it, it is still one write at a time and you could use other methods to "govern" the concurrent operations, but this at least will not blow up the call stack.

Form MongoDB 2.6 and greater you can make use of the Bulk Operations API in order to process more than one write at a time on the server. So the process is similar, but this time you can send 1000 at a time to the server in a single write and response, which is much faster:

var tempColSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});

var TempCol = mongoose.model( "TempCol", tempColSchema );

mongoose.connect( 'mongodb://localhost/mean-dev' );

mongoose.on("open",function(err,conn) {

    var i = 0;
    var bulk = TempCol.collection.initializeOrderedBulkOp();

    async.whilst(
      function() { return i < 10000000; },
      function(callback) {
        i++;
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c%100000;

        bulk.insert({ "cid": Math.random(), "loc": [ lon, lat ] });

        if ( i % 1000 == 0 ) {
            bulk.execute(function(err,result) {
                bulk = TempCol.collection.initializeOrderedBulkOp();
                callback(err);
            });
        } else {
            process.nextTick(callback);
        }
      },
      function(err) {
        // When the loop is complete or on error

        // If you had a number not plainly divisible by 1000
        if ( i % 1000 != 0 )
            bulk.execute(function(err,result) {
                // possibly check for errors here
            });
      }
    );

});

That is actually using the native driver methods which are not yet exposed in mongoose, so the additional care there is being taken to make sure the connection is available. That's an example but not the only way, but the main point is the mongoose "magic" for connections is not built in here so you should be sure it is established.

You have a round number of items to process, but where it is not the case you should be calling the bulk.execute() in that final block as well as shown, but it depends on the number responding to the modulo.

The main point is to not grow a stack of operations to an unreasonable size, and keep the processing limited. The flow control here allows operations that will take some time to actually complete before moving on to the next iteration. So either the batch updates or some additional parallel queuing is what you want for best performance.

There is also the .initializeUnorderedBulkOp() form for this if you don't want write errors to be fatal but handle those in a different way instead. Mostly see the official documentation on Bulk API and responses for how to interpret the response given.

Upvotes: 2

Related Questions