Reputation: 2701
Using node.js, with the intention of running this module as an AWS Lambda function.
Using s3.getObject()
from aws-sdk
, I am able to successfully pick up a very large CSV file from Amazon S3. The intention is to read each line in the file and emit an event with the body of each line.
In all examples I could find, it looks like the entire CSV file in S3 has to be buffered or streamed, converted to a string and then read line by line.
s3.getObject(params, function(err, data) {
var body = data.Body.toString('utf-8');
}
This operation takes a very long time, given the size of the source CSV file. Also, the CSV rows are of varying length, and I'm not certain if I can use the buffer size as an option.
Question
Is there a way to pick up the S3 file in node.js and read/transform it line by line, which avoids stringifying the entire file in memory first?
Ideally, I'd prefer to use the better capabilities of fast-csv
and/or node-csv
, instead of looping manually.
Upvotes: 15
Views: 40669
Reputation: 431
I do not have enough reputation to comment but as of now the accepted answer method "fromStream" is deprecated on 'fast-csv'. Now you'll need to use the parseStream method:
const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').parseStream(s3Stream)
.on('data', (data) => {
// use rows
})
Upvotes: 17
Reputation: 677
For me, the answer that solved my issue was,
const csv = require('@fast-csv/parse');
const params = {
Bucket: srcBucket,
Key: srcKey,
};
const csvFile = s3.getObject(params).createReadStream();
let parserFcn = new Promise((resolve, reject) => {
const parser = csv
.parseStream(csvFile, { headers: true })
.on("data", function (data) {
console.log('Data parsed: ', data);
})
.on("end", function () {
resolve("csv parse process finished");
})
.on("error", function () {
reject("csv parse process failed");
});
});
try {
await parserFcn;
} catch (error) {
console.log("Get Error: ", error);
}
Upvotes: 6
Reputation: 17168
You should just be able to use the createReadStream
method and pipe it into fast-csv:
const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').fromStream(s3Stream)
.on('data', (data) => {
// do something here
})
Upvotes: 26
Reputation: 3404
Not line by line but you can get S3 objects by byte range using the Range
header. So you could read, say, 1000 bytes at a time and manage the new lines on your end as you read the data. Look on the GET Object documentation and search for the Range header.
Upvotes: 0