Symphony0084
Symphony0084

Reputation: 1435

How to read a CSV stream from S3, but starting from somewhere in the middle of the file?

As the title states, my question pertains mostly to reading CSV data from AWS S3. I will provide details about the other technologies I am using, but they are not important to the core problem.

Context (not the core issue, just some extra detail)

I have a use case where I need to process some very large CSVs using a Node.js API on AWS Lambda and store some data from each CSV row to DynamoDB.

My implementation works well for small-to-medium-sized CSV files. However, for large CSV files (think 100k - 1m rows), the process takes way more than 15 minutes (the maximum execution time for an AWS Lambda function).

I really need this implementation to be serverless (because the rest of the project is serverless, because of a lack of predictable usage patterns, etc...).

So I decided to try and process the beginning of the file for 14.5 minutes or so, and then queue a new Lambda function to pick up where the last one left off.

I can easily pass the row number from the last function to the new function, so the new Lambda function knows where to start from.

So if the 1st function processed lines 1 - 15,000, then the 2nd function would pick up the processing job at row 15,001 and continue from there. That part is easy.

But I can't figure out how to start a read stream from S3 beginning in the middle. No matter how I set up my read stream, it always starts data flow from the beginning of the file.

It is impossible to break the processing task into smaller pieces (like queueing new Lambdas for each row) as I have already done this and optimized the process to be as minimal as possible.

Even if the 2nd job starts reading at the beginning of the file and I set it up to skip the already-processed rows, it will still take too long to get to the end of the file.

And even if I do some other implementation (like using EC2 instead of Lambda), I still run into the same problem. What if the EC2 process fails at row 203,001? I would need to queue up a new job to pick up from the next row. No matter what technology I use or what container/environment, I still need to be able to read from the middle of a file.

Core Problem

So... let's say I have a CSV file saved to S3. And I know that I want to start reading from row 15,001. Or alternatively, I want to start reading from the 689,475th byte. Or whatever.

Is there a way to do that? Using the AWS SDK for Node.js or any other type of request?

I know how to set up a read stream from S3 in Node.js, but I don't know how it works under the hood as far as how the requests are made. Maybe that knowledge would be helpful.

Upvotes: 2

Views: 1769

Answers (1)

Symphony0084
Symphony0084

Reputation: 1435

Ah it was so much easier than I was making it... Here is the answer in Node.js:

new aws.S3()
    .getObject({
        Key: 'bigA$$File.csv',
        Bucket: 'bucket-o-mine',
        Range: 'bytes=65000-100000',
    })
    .createReadStream()

Here is the doc: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html

You can do this in any of the AWS SDKs or via HTTP header.

Here's what AWS has to say about the range header:

Downloads the specified range bytes of an object. For more information about the HTTP Range header, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.

Upvotes: 1

Related Questions