Reputation: 1925
How to copy a few terabytes of data from GCS to S3?
There's nice "Transfer" feature in GCS that allows to import data from S3 to GCS. But how to do the export, the other way (besides moving data generation jobs to AWS)?
Q: Why not gsutil
?
Yes, gsutil supports s3://
, but transfer is limited by that machine network throughput. How to easier do it in parallel?
I tried Dataflow (aka Apache Beam now), that would work fine, because it's easy to parallelize on like a hundred of nodes, but don't see there's simple 'just copy it from here to there' function.
UPDATE: Also, Beam seems to be computing a list of source files on the local machine in a single thread, before starting the pipeline. In my case that takes around 40 minutes. Would be nice to distribute it on the cloud.
UPDATE 2: So far I'm inclined to use two own scripts that would:
The drawback is that it's writing a code that may contain bugs etc, not using a built-in solution like GCS "Transfer".
Upvotes: 2
Views: 1418
Reputation: 12145
You could use gsutil running on Compute Engine (or EC2) instances (which may have higher network bandwidth available than your local machine). Using gsutil -m cp will parallelize copying across objects, but individual objects will still be copied sequentially.
Upvotes: 2