Hitesh Goel
Hitesh Goel

Reputation: 15

Best way to migrate GB's of data from MongoDB to Cassandra using nodejs script

I have a big collection in MongoDB. Want to migrate all data by running some business logic nodejs scripts on that data to cassandra. What is the best way to do this ?

I have made a script in which i am getting 5000 documents in a single request from mongo and processing the data and inserting the documents into cassandra. It takes a lot of time after 40-50 iterations. CPU usage shows 100%. is this because of a lot of callbacks happening ? I am new to node js so not able to conclude anything.` var cassandra = require('../models/tracking_cassandra'); var TrackingEvents = require('../models/tracking_mongo_events');

var counter = 0;
var incr = 5000;
var final_counter = 0;

var start_point = function (callback){
    TrackingEvents.count(function(err, data){
        final_counter = data;
        TrackingEvents.getEventsByCounter(counter, function(counter, obj) {
            var prevId = obj[0].toObject()._id;
            getMessagesFromMongo(prevId, callback);
        });
    });
};

function getMessagesFromMongo(prevId, callback){
    counter = counter + incr;
    TrackingEvents.getEventsByCounter(counter, function(counter, obj) {
        var nextId = obj[0].toObject()._id;
        var start_time = new Date();
        TrackingEvents.getEventsBtwIds(prevId, nextId, function ( err, userEvents ) {
            if(userEvents.length !== 0){
                insert_into_cassandra( userEvents, callback );
            }else{
                console.log('empty data set');
            }
        });
        if(counter >= final_counter){
            callback();
        }else{
            getMessagesFromMongo(nextId, callback);
        }
    });
};

var insert_into_cassandra = function( events, callback ){
    var inserts = 0;
    total_documents = total_documents + events.length;
    for(var i = 0 ; i< events.length ; i++){
        var userEventData = events[i].toObject();
        if(typeof userEventData.uid == 'undefined'){
            total_nuid ++;
        }else{
            create_cassandra_query( userEventData );
        }
    }
};

var create_cassandra_query = function ( eventData ) {
    delete eventData._id;
    delete eventData[0];
    delete eventData.appid;
    delete eventData.appversion;
    var query = "INSERT INTO userwise_events ";
    var keys = "(";
    var values = "(";
    for(var key in eventData){
        if(eventData[key] == null || typeof eventData[key] == 'undefined'){
            delete eventData[key];
        }
        if (eventData.hasOwnProperty(key)) {
            keys = keys + key + ', ';
            values = values + ':' + key + ', ';
        }
        if(key != 'uid' && key!= 'date_time' && key != 'etypeId'){
            eventData[key] = String(eventData[key]);
        }
    }
    keys = keys.slice(0, -2);
    values = values.slice(0, -2);
    keys = keys + ")";
    values = values + ")";
    query = query + keys + " VALUES " + values;
    cassandra.trackingCassandraClient.execute(query, eventData, { prepare: true }, function (err, data) {
        if(err){
            console.log(err);
        }
    });
};

var start_time = new Date();
start_point(function(res, err){
    var end_time = new Date();
    var diff = end_time.getTime() - start_time.getTime();
    var seconds_diff = diff / 1000;
    var totalSec = Math.abs(seconds_diff);
    console.log('Total Execution Time : ' + totalSec);
});

process.on('uncaughtException', function (err) {
    console.log('Caught exception: ' + err);
});`

Upvotes: 1

Views: 402

Answers (1)

rsp
rsp

Reputation: 111506

is this because of a lot of callbacks happening ?

There may be no callbacks at all for all I know - it's impossible to tell you what's the problem with your code of which you didn't include even a single line of code.

For such a vague question I can only give you a general advice: make sure you don't have long running for or while loops. And don't ever use a blocking system call anywhere else than on the first tick of the event loop. If you don't know what is the first tick of the event loop then don't use blocking calls at all. Whenever you can, use streams for data - especially if you have lots of it.

A 100% CPU utilization is a bad sign and should never happen for I/O-heavy operation like the one that you are trying to perform. You should easily be able to handle insane amounts of data, especially when you use streams. Having your process max out the CPU for an inherently I/O-bound operation like moving large amounts of data through a network is a sure sign that you're doing something wrong in your code. What exactly is that? That will remain a mystery since you didn't show us even a single line of your code.

Upvotes: 2

Related Questions