user1620696
user1620696

Reputation: 11375

How to efficiently process a big list of data like that in Node.js?

I have a quite big list of data in a file and I need to process the data in Node.js. The list is a list of URL's and the work done on each URL is essentially a request together with some work done on the response.

Since the code that does the work is quite big, I'll just call the function which starts it all as doWork(). It takes the data and a callback, so that it is something like

function doWork(data, callback)

Now, the way I'm currently doing it is as follows: I coded one queueManager module in the following way:

var queueManager = {};
queueManager.queue = [];

queueManager.addForProcessing = function (data) {
    this.queue.push(data);
};

queueManager.processing = false;

queueManager.startProcessing = function () {
    if (!this.processing) {
        this.process();
        this.processing = true;
    }
};

queueManager.process = function () {
    var self = this;
    if (this.queue.length > 0) {
        doWork(this.queue.pop(), function () {
            self.process();
        });
    } else {
        this.processing = false;
    }
};

module.exports = queueManager;

And I use it together with readline:

rl.on('line', function (data) {
    queueManager.addForProcessing(data);
    queueManager.startProcessing();
});

This works, but it doesn't seem efficient. It is synchronous code really. Each line is processed at a time, and while one line is processed nothing happens. The problem is that sometimes, since there is a request for each line, the processing for each line takes some time and this compromises the efficiency of the app. In truth, the file with the data has more than 250K lines, so this becomes easily a problem.

Now, why I added this queue manager? Because if I did just:

rl.on('line', function (data) {
    doWork(data, function () {
        console.log(`${data} has been processed...`);
    });
});

The app just doesn't work. It starts to process the same data tons of time and there isn't a single processing that gets right.

My workaround worked, but it cause efficiency problems.

So in that case, if I have a big list of data and I need to do processing on this data involving operations like web requests and others alike, how can I do it in a more efficient way than what I did?

Upvotes: 0

Views: 733

Answers (1)

Hugo Silva
Hugo Silva

Reputation: 6948

You should have a look at clusters and workers - https://nodejs.org/api/cluster.html

A single instance of Node.js runs in a single thread. To take advantage of multi-core systems the user will sometimes want to launch a cluster of Node.js processes to handle the load.

The cluster module allows you to easily create child processes that all share server ports.

You can basically split your application into two processes, and send the big data process to the background. Than you can use messages to show the queue status from your main app process.

Here is a nice tutorial on cluster - https://www.sitepoint.com/how-to-create-a-node-js-cluster-for-speeding-up-your-apps/

Upvotes: 1

Related Questions