Andremoniy
Andremoniy

Reputation: 34900

Stream and zip to S3 from AWS Lambda Node.JS

My goal is to create a large gzipped text file and put it into S3.

The file contents consist of blocks which I read in a loop from another source.

Because of the size of this file I can not hold all data in memory, so I need to somehow stream it directly to S3 and ZIP at the same time.

I understand how to perform this trick with the regular fs in Node.JS, but I am confused about whether is it possible to do the same trick with S3 from AWS Lambda? I know that s3.putObject can consume streamObject, but it seems to me that this stream should be already finalized when I perform putObject operation, what can cause exceeding of the allowed memory.

Upvotes: 5

Views: 3408

Answers (1)

Unglückspilz
Unglückspilz

Reputation: 1890

You can stream files (>5mb) into S3 buckets in chunks using multipart upload functions in the NodeJs aws-sdk.

This is not only useful for streaming large files into buckets, but also enables you to retry failed chunks (instead of a whole file) and parallelize upload of individual chunks (with multiple, upload lambdas, which could be useful in a serverless ETL setup for example). The order in which they arrive is not important as long as you track them and finalize the process once all have been uploaded.

To use the multipart upload, you should:

  1. initialize the process using createMultipartUpload and store the returned UploadId (you'll need it for chunk uploads)
  2. implement a Transform stream that would process data coming from the input stream
  3. implement a PassThrough stream which would buffer the data in large enough chunks before using uploadPart to push them to S3 (under the UploadId returned in step 1)
  4. track the returned ETags and PartNumbers from chunk uploads
  5. use the tracked ETags and PartNumbers to assemble/finalize the file on S3 using completeMultipartUpload

Here's the gist of it in a working code example which streams a file from iso.org, pipes it through gzip and into an S3 bucket. Don't forget to change the bucket name and make sure to run the lambda with 512mb of memory on node 6.10. You can use the code directly in the web GUI since there are no external dependencies.

NOTE: This is just a proof of concept that I put together for demonstration purposes. There is no retry logic for failed chunk uploads and error handling is almost non-existent which can literally cost you (e.g. abortMultipartUpload should be called upon cancelling the whole process to clean up the uploaded chunks since they remain stored and invisible on S3 even though the final file was never assembled). The input stream is being paused instead of queuing upload jobs and utilizing backpressure stream mechanisms etc.

Upvotes: 7

Related Questions