Reputation: 111
I have multiple S3 files in a bucket.
Input S3 bucket :
File1 - 2GB data
File 2 - 500MB data
File 3 - 1Gb Data
file 4 - 2GB data
and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2
.
I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before.
Files 1 - < 1GB
Files 2 - < 1GB
Files 3 - < 1GB
I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions :
Expected scales -> Overall file input size sum : 1 TB
What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts.
Thanks!
Edit : Based on a comment -> Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code
List<Files> listOfFiles = ReadFromS3(key)
New file named temp.csv
for each file : listOfFiles :
append file to temp.csv
List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each
for(File file : finalList)
writeToS3(finalList)
Upvotes: 0
Views: 1737
Reputation: 5005
If your process includes ETL(Extraction Transformation Load) post process, you could use AWS GLUE Please find here an example for Glue using s3 as a source. If you’d like to use it with Java SDK, the best starting points are:
Out of all of them your the Tutorial to create a crawler (that you can find in GitHub as per above url) should match your case as it crawls an S3 bucket and put it in a glue catalog for transformation.
Upvotes: 0
Reputation: 269340
Amazon Athena can run a query across multiple objects in a given Amazon S3 path, as long as they all have the same format (eg same columns in a CSV file).
It can store the result in a new External Table, with a location pointing to an S3 bucket, by using a CREATE TABLE AS
command and a LOCATION
parameter.
The size of the output files can be controlled by setting the number of output buckets (which is not the same as an S3 bucket).
See:
Upvotes: 3