kevzettler
kevzettler

Reputation: 5213

Best way to read many files with nodejs?

I have a large glob of file paths. I'm getting this path list from a streaming glob module https://github.com/wearefractal/glob-stream

I was piping this stream to another stream that was creating fileReadStreams for each path and quickly hitting some limits. I was getting the:

warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit

and also Error: EMFILE, open

I've tried bumping the maxListeners but I have ~9000 files that would be creating streams and I'm concerned that will eat memory that number is not constant and will grow. Am I safe to remove the limit here?

Should I be doing this synchronously? or should I be iterating over the paths and reading the files sequentially? Won't that still execute all the reads at once using a for loop?

Upvotes: 2

Views: 800

Answers (1)

josh3736
josh3736

Reputation: 144912

The max listeners thing is purely a warning. setMaxListeners only controls when that message is printed to the console, nothing else. You can disable it or just ignore it.

The EMFILE is your OS enforcing a limit on the number of open files (file descriptors) your process can have at a single time. You could avoid this by increasing the limit with ulimit.

Because saturating the disk by running many thousands of concurrent filesystem operations won't get you any added performance—in fact, it will hurt, especially on traditional non-SSD drives—it is a good idea to only run a controlled number of operations at once.

I'd probably use an async queue, which allows you to push the name of every file to the queue in one loop, and then only runs n operations at once. When an operation finishes, the next one in the queue starts.

For example:

var q = async.queue(function (file, cb) {
    var stream = fs.createReadStream(file.path);
    // ...
    stream.on('end', function() {
        // finish up, then
        cb();
    });
}, 2);

globStream.on('data', function(file) {
    q.push(file);
});

globStream.on('end', function() {
    // We don't want to add the `drain` handler until *after* the globstream
    // finishes.  Otherwise, we could end up in a situation where the globber
    // is still running but all pending file read operations have finished.
    q.drain = function() {
        // All done with everything.
    };

    // ...and if the queue is empty when the globber finishes, make sure the done
    // callback gets called.
    if (q.idle()) q.drain();
});

You may have to experiment a little to find the right concurrency number for your application.

Upvotes: 2

Related Questions