Alexander Mills
Alexander Mills

Reputation: 100000

Highest performance way to fork/spawn many node.js processes from parent process

I am using Node.js to spawn upwards of 100 child processes, maybe even 1000. What concerns me is that the parent process could become some sort of bottleneck if all the stdout/stderr of the child processes has to go through the parent process in order to get logged somewhere.

So my assumption is that in order to achieve highest performance/throughput, we should ignore stdout/stderr in the parent process, like so:

const cp = require('child_process');

items.forEach(function(exec){

   const n = cp.spawn('node', [exec], {
      stdio: ['ignore','ignore','ignore','ipc']
   });

});

My question is, how much of a performance penalty is it to use pipe in this manner:

// (100+ items to iterate over)

items.forEach(function(exec){

   const n = cp.spawn('node', [exec], {
      stdio: ['ignore','pipe','pipe','ipc']
   });

});

such that stdout and stderr are piped to the parent process? I assume the performance penalty could be drastic, especially if we handle stdout/stderr in the parent process like so:

     // (100+ items to iterate over)

    items.forEach(function(exec){

       const n = cp.spawn('node', [exec], {
          stdio: ['ignore','pipe','pipe','ipc']
       });

       n.stdout.setEncoding('utf8');
       n.stderr.setEncoding('utf8');

        n.stdout.on('data', function(d){
          // do something with the data
        });

        n.stderr.on('data', function(d){
          // do something with the data
        });

    });

I am assuming

  1. I assume if we use 'ignore' for stdout and stderr in the parent process, that this is more performant than piping stdout/stderr to parent process.
  2. I assume if we choose a file to stream stdout/stderr to like so

    stdio: ['ignore', fs.openSync('/some/file.log'), fs.openSync('/some/file.log'),'ipc']

that this is almost as performant as using 'ignore' for stdout/stderr (which should send stdout/stderr to /dev/null)

Are these assumptions correct or not? With regard to stdout/stderr, how can I achieve highest performance, if I want to log the stdout/stderr somewhere (not to /dev/null)?

Note: This is for a library so the amount of stdout/stderr could vary quite a bit. Also, most likely will rarely fork more processes than there are cores, at most running about 15 processes simultaneously.

Upvotes: 6

Views: 4529

Answers (2)

jcaron
jcaron

Reputation: 17710

You have the following options:

  • you can have the child process completely ignore stdout/stderr, and do logging on its own by any other means (log a to a file, syslog...)

  • if you're logging the output of your parent process, you can set stdout/stderr to process.stdout and process.stderr respectively. This means the output of the child will be the same as the main process. Nothing will flow through the main process

  • you can set file descriptors directly. This means the output of the child process will go to the given files, without going through the parent process

  • however, if you don't have any control over the child processes AND you need to somehow do something to the logs (filter them, prefix them with the associated child process, etc.), then you probably need to go through the parent process.

As we have no idea of the volume of logs you're talking about, we have no idea whether this is critical or just premature optimisation. Node.js being asynchronous, I don't expect your parent process becoming a bottleneck unless it's really busy and you have lots of logs.

Upvotes: 4

John Zwinck
John Zwinck

Reputation: 249133

Are these assumptions correct or not?

how can I achieve highest performance?

Test it. That's how you can achieve the highest performance. Test on the same type of system you will use in production, with the same number of CPUs and similar disks (SSD or HDD).

I assume your concern is that the children might become blocked if the parent does not read quickly enough. That is a potential problem, depending on the buffer size of the pipe and how much data flows through it. However, if the alternative is to have each child process write to disk independently, this could be better, the same, or worse. We don't know for a whole bunch of reasons, starting with the fact that we have no idea how many cores you have, how quickly your processes produce data, and what I/O subsystem you're writing to.

If you have a single SSD you might be able to write 500 MB per second. That's great, but if that SSD is 512 GB in size, you'll only last 16 minutes before it is full! You'll need to narrow down the problem space a lot more before anyone can know what's the most efficient approach.

If your goal is simply to get logged data off the machine with as little system utilization as possible, your best bet is to directly write your log messages to the network.

Upvotes: 1

Related Questions