I have multiple S3 files in a bucket. Input S3 bucket : File1 - 2GB data File 2 - 500MB data File 3 - 1Gb Data file 4 - 2GB data and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2 . I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before. Files 1 - < 1GB Files 2 - < 1GB Files 3 - < 1GB I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions : AWS Athena - reads and writes to S3 but not sure if I can set up a 1GB limit while writing. AWS Lambda - read file sequentially, store in memory, when size is near 1GB, write to new file in s3 bucket. Repeat until all files completed. I'm worried about the 15 min timeout, not sure if lambda will be able to process. Expected scales -> Overall file input size sum : 1 TB What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts. Thanks! Edit : Based on a comment -> Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code List<Files> listOfFiles = ReadFromS3(key) New file named temp.csv for each file : listOfFiles : append file to temp.csv List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each for(File file : finalList) writeToS3(finalList)

amazon-web-servicesamazon-s3aws-lambdaamazon-athena

Reputation: 111

Merge S3 files into multiple <1GB S3 files

I have multiple S3 files in a bucket.

Input S3 bucket : 
File1 - 2GB data
File 2 - 500MB data
File 3 - 1Gb Data
file 4 - 2GB data

and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2.

I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before.

Files 1 - < 1GB 
Files 2 - < 1GB
Files 3 - < 1GB

I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions :

AWS Athena - reads and writes to S3 but not sure if I can set up a 1GB limit while writing.
AWS Lambda - read file sequentially, store in memory, when size is near 1GB, write to new file in s3 bucket. Repeat until all files completed. I'm worried about the 15 min timeout, not sure if lambda will be able to process.

Expected scales -> Overall file input size sum : 1 TB

What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts.

Thanks!

Edit : Based on a comment -> Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code

    List<Files> listOfFiles = ReadFromS3(key)
    New file named temp.csv
    for each file : listOfFiles : 
        append file to temp.csv
    List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each
    for(File file : finalList) 
        writeToS3(finalList)

Upvotes: 0

Answers (2)

AR1

Reputation: 5005

If your process includes ETL(Extraction Transformation Load) post process, you could use AWS GLUE Please find here an example for Glue using s3 as a source. If you’d like to use it with Java SDK, the best starting points are:

Out of all of them your the Tutorial to create a crawler (that you can find in GitHub as per above url) should match your case as it crawls an S3 bucket and put it in a glue catalog for transformation.

Upvotes: 0

John Rotenstein

Reputation: 269340

Amazon Athena can run a query across multiple objects in a given Amazon S3 path, as long as they all have the same format (eg same columns in a CSV file).

It can store the result in a new External Table, with a location pointing to an S3 bucket, by using a CREATE TABLE AS command and a LOCATION parameter.

The size of the output files can be controlled by setting the number of output buckets (which is not the same as an S3 bucket).

See:

Upvotes: 3

Merge S3 files into multiple &lt;1GB S3 files

Answers (2)

Related Questions

Merge S3 files into multiple <1GB S3 files