Reputation: 79248
Say you have a huge (> 1GB) CSV of record ids:
655453
4930285
493029
4930301
493031
...
And for each id
you want to make a REST API call to fetch the record data, transform it locally, and insert it into a local database.
How do you do that with Node.js' Readable Stream
?
My question is basically this: How do you read a very large file, line-by-line, run an async function for each line, and [optionally] be able to start reading the file from a specific line?
From the following Quora question I'm starting to learn to use fs.createReadStream
:
http://www.quora.com/What-is-the-best-way-to-read-a-file-line-by-line-in-node-js
var fs = require('fs');
var lazy = require('lazy');
var stream = fs.createReadStream(path, {
flags: 'r',
encoding: 'utf-8'
});
new lazy(stream).lines.forEach(function(line) {
var id = line.toString();
// pause stream
stream.pause();
// make async API call...
makeAPICall(id, function() {
// then resume to process next id
stream.resume();
});
});
But, that pseudocode doesn't work, because the lazy
module forces you to read the whole file (as a stream, but there's no pausing). So that approach doesn't seem like it will work.
Another thing is, I would like to be able to start processing this file from a specific line. The reason for this is, processing each id
(making the api call, cleaning the data, etc.) can take up to a half a second per record so I don't want to have to start from the beginning of the file each time. The naive approach I'm thinking about using is to just capture the line number of the last id processed, and save that. Then when you parse the file again, you stream through all the ids, line by line, until you find the line number you left off at, and then you do the makeAPICall
business. Another naive approach is to write small files (say of 100 ids) and process each file one at a time (small enough dataset to do everything in memory without an IO stream). Is there a better way to do this?
I can see how this gets tricky (and where node-lazy comes in) because the chunk
in stream.on('data', function(chunk) {});
may contain only part of a line (if the bufferSize is small, each chunk may be 10 lines but because the id
is variable length, it may only be 9.5 lines or whatever). This is why I'm wondering what the best approach is to the above question.
Upvotes: 6
Views: 3473
Reputation: 960
Related to Andrew Андрей Листочкин's answer:
You can use a module like byline to get a separate data
event for each line. It's a transform stream around the original filestream, which produces a data
event for each chunk. This lets you pause after each line.
byline
won't read the entire file into memory like lazy
apparently does.
var fs = require('fs');
var byline = require('byline');
var stream = fs.createReadStream('bigFile.txt');
stream.setEncoding('utf8');
// Comment out this line to see what the transform stream changes.
stream = byline.createStream(stream);
// Write each line to the console with a delay.
stream.on('data', function(line) {
// Pause until we're done processing this line.
stream.pause();
setTimeout(() => {
console.log(line);
// Resume processing.
stream.resume();
}, 200);
});
Upvotes: 2
Reputation: 8542
I guess you don't need to use node-lazy
. Here's what I found in Node docs:
Event:
data
function (data) { }
The
data
event emits either aBuffer
(by default) or astring
ifsetEncoding()
was used.
So that means that is you call setEncoding()
on your stream then your data
event callback will accept a string parameter. Then inside this callback you can call use .pause()
and .resume()
methods.
The pseudo code should look like this:
stream.setEncoding('utf8');
stream.addListener('data', function (line) {
// pause stream
stream.pause();
// make async API call...
makeAPICall(line, function() {
// then resume to process next line
stream.resume();
});
})
Although the docs don't explicitly specify that stream is read line by line I assume that that's the case for file streams. At least in other languages and platforms text streams work that way and I see no reason for Node streams to differ.
Upvotes: 1