Reputation: 34900
My goal is to create a large gzipped text file and put it into S3.
The file contents consist of blocks which I read in a loop from another source.
Because of the size of this file I can not hold all data in memory, so I need to somehow stream it directly to S3 and ZIP at the same time.
I understand how to perform this trick with the regular fs
in Node.JS, but I am confused about whether is it possible to do the same trick with S3 from AWS Lambda? I know that s3.putObject
can consume streamObject
, but it seems to me that this stream should be already finalized when I perform putObject
operation, what can cause exceeding of the allowed memory.
Upvotes: 5
Views: 3408
Reputation: 1890
You can stream files (>5mb) into S3 buckets in chunks using multipart upload functions in the NodeJs aws-sdk.
This is not only useful for streaming large files into buckets, but also enables you to retry failed chunks (instead of a whole file) and parallelize upload of individual chunks (with multiple, upload lambdas, which could be useful in a serverless ETL setup for example). The order in which they arrive is not important as long as you track them and finalize the process once all have been uploaded.
To use the multipart upload, you should:
createMultipartUpload
and store the returned UploadId
(you'll need it for chunk uploads)uploadPart
to push them to S3 (under the UploadId
returned in step 1)ETags
and PartNumbers
from chunk uploadsETags
and PartNumbers
to assemble/finalize the file on S3 using completeMultipartUpload
Here's the gist of it in a working code example which streams a file from iso.org, pipes it through gzip and into an S3 bucket. Don't forget to change the bucket name and make sure to run the lambda with 512mb of memory on node 6.10. You can use the code directly in the web GUI since there are no external dependencies.
NOTE: This is just a proof of concept that I put together for demonstration purposes. There is no retry logic for failed chunk uploads and error handling is almost non-existent which can literally cost you (e.g. abortMultipartUpload
should be called upon cancelling the whole process to clean up the uploaded chunks since they remain stored and invisible on S3 even though the final file was never assembled). The input stream is being paused instead of queuing upload jobs and utilizing backpressure stream mechanisms etc.
Upvotes: 7