Handling big files with Google Cloud Storage API

Question

What I need to achieve is to concatenate a list of files into a single file, using the cloudstorage library. This needs to happen inside a mapreduce shard, which has a 512MB upper limit on memory, but the concatenated file could be larger than 512MB.

The following code segment breaks when file size hit the memory limit.

list_of_files = [...]
with cloudstorage.open(filename...) as file_handler:
    for a in list_of_files:
        with cloudstorage.open(a) as f:
            file_handler.write(f.read())

Is there a way to walk around this issue? Maybe open or append files in chunk? And how to do that? Thanks!

== EDIT ==

After some more testing, it seems that memory limit only applies to f.read(), while writing to a large file is okay. Reading files in chunks solved my issue, but I really like the compose() function as @Ian-Lewis pointed out. Thanks!

Ian Lewis · Accepted Answer

For large file you will want to break the file up into smaller files, upload each of those and then merge them together as composite objects. You will want to use the compose() function from the library. It seems there is no docs on it yet.

After you've uploaded all the parts something like the following should work. One thing to make sure of is that the paths files to be composed don't contain the bucket name or a slash at the beginning.

stat = cloudstorage.compose(
    [
        "path/to/part1",
        "path/to/part2",
        "path/to/part3",
        # ...
    ],
    "/my_bucket/path/to/output"
)

You may also want to check out using the gsutil tool if possible. It can do automatic splitting, uploading in parallel, and compositing of large files for you.

Handling big files with Google Cloud Storage API

Answers (1)

Related Questions