Reputation: 197
Suppose I have a directory that contains 100K+ or even 500k+ files. I want to read the directory with fs.readdir
, but it's async not stream. Someone tell me that async use memory before done read the entire file list.
So what is the solution? I want to readdir with stream approach. Can I?
Upvotes: 15
Views: 13187
Reputation: 2327
The answer by @mstephen19 gave the right direction, but it uses an async generator where Readable.read()
does not support it. If you try to turn opendirGen()
into a recursive function, to recurse into directories, it does not work anymore.
Using Readable.from()
is the solution here. The following is his solution adapted as such (with opendirGen()
still not recursive):
import { opendir } from 'node:fs/promises';
import { Readable } from 'node:stream';
async function* opendirGen(dir) {
for await ( const file of await opendir('/tmp') ) {
yield file.name;
}
};
Readable
.from(opendirGen('/tmp'), {encoding: 'utf8'})
.on('data', name => console.log(name));
Upvotes: 0
Reputation: 1926
Here are two viable solutions:
fs.opendir
function to create a Dir
object, which has a Symbol.asyncIterator
property.import { opendir } from 'fs/promises';
// An async generator that accepts a directory name
const openDirGen = async function* (directory: string) {
// Create a Dir object for that directory
const dir = await opendir(directory);
// Iterate through the items in the directory asynchronously
for await (const file of dir) {
// (yield whatever you want here)
yield file.name;
}
};
The usage of this is as follows:
for await (const name of openDirGen('./src')) {
console.log(name);
}
Readable
stream can be created using the async generator we created above.// ...
import { Readable } from 'stream';
// ...
// A function accepting the directory name
const openDirStream = (directory: string) => {
return new Readable({
// Set encoding to utf-8 to get the names of the items in
// the directory as utf-8 strings.
encoding: 'utf-8',
// Create a custom read method which is async, but works
// because it doesn't need to be awaited, as Readable is
// event-based anyways.
async read() {
// Asynchronously iterate through the items names in
// the directory using the openDirGen generator.
for await (const name of openDirGen(directory)) {
// Push each name into the stream, emitting the
// 'data' event each time.
this.push(name);
}
// Once iteration is complete, manually destroy the stream.
this.destroy();
},
});
};
You can use this the same way you'd use any other Readable
stream:
const myDir = openDirStream('./src');
myDir.on('data', (name) => {
// Logs the file name of each file in my './src' directory
console.log(name);
// You can do anything you want here, including actually reading
// the file.
});
Both of these solutions will allow you to asynchronously iterate through the item names within a directory rather than pull them all into memory at once like fs.readdir
does.
Upvotes: 0
Reputation: 4522
The more modern answer for this is to use opendir
(added v12.12.0
) to iterate over each found file, as it is found:
import { opendirSync } from "fs";
const dir = opendirSync("./files");
for await (const entry of dir) {
console.log("Found file:", entry.name);
}
fsPromises.opendir
/openddirSync
return an instance of Dir
which is an iterable which returns a Dirent
(directory entry) for every file in the directory.
This is more efficient because it returns each file as it is found, rather than having to wait till all files are collected.
Upvotes: 1
Reputation: 2895
As of version 10, there is still no good solution for this. Node is just not that mature yet.
modern filesystems can easily handle millions of files in a directory. And of cause you can make a god cases for it, in a large scale operations, as you suggests.
The underlying C library iterates over the directory list, one at a time, as it should. But all node implementations I have seen, that claims to iterate, uses fs.readdir
, that reads all into memory, as fast as it can.
As I understand it, you have to wait for a new version of libuv to be adopted into node. And then for the maintainers to address this old issue. See discussion at https://github.com/nodejs/node/issues/583
Some improvements will happen in with version 12.
Upvotes: -1
Reputation: 1771
Now there is a way to do it with async iteration! You can do:
const dir = fs.opendirSync('/tmp')
for await (let file of dir) {
console.log(file.name)
}
To turn it into a stream:
const _pipeline = util.promisify(pipeline)
await _pipeline([
Readable.from(dir),
... // consume!
])
Upvotes: 8
Reputation: 9918
In modern computers traversing a directory with 500K files is nothing. When you fs.readdir
asynchronously in Node.js, what it does is just read a list of file names in the specified directory. It doesn't read the files' contents. I've just tested with 700K files in the dir. It takes only 21MB of memory to load this list of file names.
Once you've loaded this list of file names, you just traverse them one by one or in parallel by setting some limit for concurrency and you can easily consume them all. Example:
var async = require('async'),
fs = require('fs'),
path = require('path'),
parentDir = '/home/user';
async.waterfall([
function (cb) {
fs.readdir(parentDir, cb);
},
function (files, cb) {
// `files` is just an array of file names, not full path.
// Consume 10 files in parallel.
async.eachLimit(files, 10, function (filename, done) {
var filePath = path.join(parentDir, filename);
// Do with this files whatever you want.
// Then don't forget to call `done()`.
done();
}, cb);
}
], function (err) {
err && console.trace(err);
console.log('Done');
});
Upvotes: 8