Read and parse CSV file in S3 without downloading the entire file

Using node.js, with the intention of running this module as an AWS Lambda function.

Using s3.getObject() from aws-sdk, I am able to successfully pick up a very large CSV file from Amazon S3. The intention is to read each line in the file and emit an event with the body of each line.

In all examples I could find, it looks like the entire CSV file in S3 has to be buffered or streamed, converted to a string and then read line by line.

s3.getObject(params, function(err, data) {
   var body = data.Body.toString('utf-8');
}

This operation takes a very long time, given the size of the source CSV file. Also, the CSV rows are of varying length, and I'm not certain if I can use the buffer size as an option.

Question

Is there a way to pick up the S3 file in node.js and read/transform it line by line, which avoids stringifying the entire file in memory first?

Ideally, I'd prefer to use the better capabilities of fast-csv and/or node-csv, instead of looping manually.

Upvotes: 15

Answers (4)

Kai Durai

Reputation: 431

I do not have enough reputation to comment but as of now the accepted answer method "fromStream" is deprecated on 'fast-csv'. Now you'll need to use the parseStream method:

const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').parseStream(s3Stream)
  .on('data', (data) => {
    // use rows
  })

Upvotes: 17

Hoon

Reputation: 677

For me, the answer that solved my issue was,

  const csv = require('@fast-csv/parse');

  const params = {
    Bucket: srcBucket,
    Key: srcKey,
  };
  const csvFile = s3.getObject(params).createReadStream();

  let parserFcn = new Promise((resolve, reject) => {
    const parser = csv
      .parseStream(csvFile, { headers: true })
      .on("data", function (data) {
        console.log('Data parsed: ', data);
      })
      .on("end", function () {
        resolve("csv parse process finished");
      })
      .on("error", function () {
        reject("csv parse process failed");
      });
  });

  try {
    await parserFcn;
  } catch (error) {
    console.log("Get Error: ", error);
  }

Upvotes: 6

idbehold

Reputation: 17168

You should just be able to use the createReadStream method and pipe it into fast-csv:

const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').fromStream(s3Stream)
  .on('data', (data) => {
    // do something here
  })

Upvotes: 26

jzonthemtn

Reputation: 3404

Not line by line but you can get S3 objects by byte range using the Range header. So you could read, say, 1000 bytes at a time and manage the new lines on your end as you read the data. Look on the GET Object documentation and search for the Range header.

Upvotes: 0

Read and parse CSV file in S3 without downloading the entire file

Answers (4)

Related Questions