HowlingEverett
HowlingEverett

Reputation: 810

How to process a large (1000+ files) directory of files with Node.js?

So, I have a fairly large directory of files that I need to process continually using a long-running process in a NodeJS application. The directory is being continually processed and emptied, but it's not uncommon for 1000+ files to be in line for processing at any given time - they're gzipped CSV files, so my naive solution was to grab the directory listing, iterate over the files, open each, parse them, then continue, like this:

files = fs.readdirSync 'directory'

for filename in files
  file_path = path.resolve path.join 'directory', filename
  fd = fs.openSync file_path, 'r'
  buf = new Buffer fs.statSync(file_path).size
  fs.readSync fd, buf, 0, len, 0
  fs.closeSync fd
  zlib.gunzip buf, (err, buf) =>
    throw err if err
    content = buf.toString().split("\n")
    for line in content
      # parse, process content, archive file

I'm quickly running up against an EMFILE (Too Many Open Files) error. Please excuse the Sync versions of the fs functions, and the coffeescript.

Is there a better way of processing a massive number of files in a managed way? Ultimately I'd like to use something like a single parsing Stream - I know how to do that with a single large (or even growing) file, but not with a directory full of separate files.

The files are being generated by a large number of disparate clients to a public-facing web server, which then synchronises them regularly to my input directory over a secure protocol. Not an ideal setup, but necessary given the specific nature of the system, and it explains why I can't simply alter the files to be say, a single multiplexed stream.

Upvotes: 2

Views: 2946

Answers (2)

David Manheim
David Manheim

Reputation: 2626

Mixu's Node book has a section on how to manage exactly this type of problem. http://book.mixu.net/node/ch7.html

You can run code in "limited parallel," using the following code, as shown there - and it is easy to manage how many you wish to load at once, with the limit parameter:

function async(arg, callback) {
  console.log('do something with \''+arg+'\', return 1 sec later');
  setTimeout(function() { callback(arg * 2); }, 1000);
}
function final() { console.log('Done', results); }

var items = [ 1, 2, 3, 4, 5, 6 ];
var results = [];
var running = 0;
var limit = 2;

function launcher() {
  while(running < limit && items.length > 0) {
    var item = items.shift();
    async(item, function(result) {
      results.push(result);
      running--;
      if(items.length > 0) {
        launcher();
      } else if(running == 0) {
        final();
      }
    });
    running++;
  }
}

launcher();

Upvotes: 0

Floby
Floby

Reputation: 2336

Not exactly a parsing stream but could be a step towards it:

You could use https://npmjs.org/package/generic-pool to limit de number of concurrent files being processed. You just have to define what resource are to be pooled.

In your case I assume the resource to pool should be a file processor so that only one or a few can live at once.

You can also have some kind of iterator method to streamline which file is to be processed next.

EDIT: Completing my answer. I had a go at your problem and tried this https://gist.github.com/Floby/5064222

Upvotes: 1

Related Questions