user728650
user728650

Reputation: 1986

Getting status of gsutil cp command in parallel mode

This command copies a huge number of files from Google Cloud storage to my local server.

gsutil -m cp -r gs://my-bucket/files/ .

There are 200+ files, each of which is over 5GB in size.

Once all files are downloaded, another process kicks in and starts reading the files one by one and extract the info needed.

The problem is, even though the gsutil copy process is fast and downloads files in batches of multiple files at a very high speed, I still need to wait till all the files are downloaded before starting to process them.

Ideally I would like to start processing the first file as soon as it is downloaded. But with multiple cp mode, there seems to be no way of knowing when a file is downloaded (or is there?).

From Google docs, this can be done in individual file copy mode.

if ! gsutil cp ./local-file gs://your-bucket/your-object; then
  << Code that handles failures >>
fi

That means if I run the cp without -m flag, I can get a boolean on success for that file and I can kick off the file processing.

Problem with this approach is the overall download will take much longer as files are now downloading one by one.

Any insight?

Upvotes: 0

Views: 4078

Answers (1)

Mike Schwartz
Mike Schwartz

Reputation: 12145

One thing you could do is have a separate process that periodically lists the directory, filtering out the files that are incompletely downloaded (they are downloaded to a filename ending with '.gstmp' and then renamed after the download completes) and keeps track of files you haven't yet processed. You could terminate the periodic listing process when the gsutil cp process completes, or you could just leave it running, so it processes downloads for the next time you download all the files.

Two potential complications with doing that are:

  1. If the number of files being downloaded is very large, the periodic directory listings could be slow. How big "very large" is depends on the type of file system you're using. You could experiment by creating a directory with the approximate number of files you expect to download, and seeing how long it takes to list. Another option would be to use the gsutil cp -L option, which builds a manifest showing what files have been downloaded. You could then have a loop reading through the manifest, looking for files that have downloaded successfully.
  2. If the multi-file download fails partway through (e.g., due to a network connection that's dropped for longer than gsutil will retry), you'll end up with a partial set of files. For this case you might considering using gsutil rsync, which can be restarted and pick up where you left off.

Upvotes: 2

Related Questions