norbjd
norbjd

Reputation: 11287

Split and copy a file from a bucket to another bucket, without downloading it locally

I'd like to split and copy a huge file from a bucket (gs://$SRC_BUCKET/$MY_HUGE_FILE) to another bucket (gs://$DST_BUCKET/), but without downloading the file locally. I expect to do this using only gsutil and shell commands.

I'm looking for something with the same final behaviour as the following commands :

gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE my_huge_file_stored_locally

split -l 1000000 my_huge_file_stored_locally a_split_of_my_file_

gsutil -m mv a_split_of_my_file_* gs://$DST_BUCKET/

But, because I'm executing these actions on a Compute Engine VM with limited disk storage capacity, getting the huge file locally is not possible (and anyway, it seems like a waste of network bandwidth).

The file in this example is split by number of lines (-l 1000000), but I will accept answers if the split is done by number of bytes.

I took a look at the docs about streaming uploads and downloads using gsutil to do something like :

gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE - | split -1000000 | ...

But I can't figure out how to upload split files directly to gs://$DST_BUCKET/, without creating them locally (creating temporarily only 1 shard for the transfer is OK though).

Upvotes: 4

Views: 3670

Answers (1)

Mike Schwartz
Mike Schwartz

Reputation: 12155

This can't be done without downloading, but you could use range reads to build the pieces without downloading the full file at once, e.g.,

gsutil cat -r 0-10000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file1
gsutil cat -r 10001-20000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file2
...

Upvotes: 4

Related Questions