Reputation: 21
I have 5 different processes running on different virtual machines (VMs) on EC2 creating 5 different files (f1.txt, f2.txt, f3.txt, f4.txt, f5.txt). These VMs are started at roughly the same time but will finish at different times.
~ wait for these 5 files to be written out
~ merge them and create a new file e.g. f.txt = f1.txt + f2.txt + f3.txt + f4.txt + f5.txt
~ Questions: # How can I determine when all 5 files are ready and written out? # Can I use s3cat or some similar command line tool to do that? Does s3cat have similar semantics to Unix cat e.g. cat s3://mybucket/f1.txt > s3://mybucket/f.txt cat s3://mybucket/f2.txt >> s3://mybucket/f.txt cat s3://mybucket/f3.txt >> s3://mybucket/f.txt cat s3://mybucket/f4.txt >> s3://mybucket/f.txt cat s3://mybucket/f5.txt >> s3://mybucket/f.txt
Their examples on GitHub didn’t show this use case.
The output file generated (f.txt) is for use by a downstream process.
Upvotes: 2
Views: 2101
Reputation: 12945
I think you want to use multipart uploads, instead of uploading a bunch of files and catting them
Upvotes: 0
Reputation: 2046
If you know the names of the keys you are using for the 5 files you are uploading, you can just poll for them. If you know python, boto is a great module for interfacing with s3 and would make handling the above a cinch. Also, s3 does guarantee that a file won't appear to other clients until it has been completely uploaded so you don't have to worry about reading partial files.
Boto is also a good way to concatenate the output if you are already using it check for the files.
Upvotes: 1