Gzip a file in Python before uploading to Cloud Storage

Question

I have the following Python function to write the given content to a bucket in Cloud Storage:

import gzip
from google.cloud import storage

def upload_to_cloud_storage(json):
    """Write to Cloud Storage."""

    # The contents to upload as a JSON string.
    contents = json

    storage_client = storage.Client()

    # Path and name of the file to upload (file doesn't yet exist).
    destination = "path/to/name.json.gz"

    # Gzip the contents before uploading
    with gzip.open(destination, "wb") as f:
        f.write(contents.encode("utf-8"))

    # Bucket
    my_bucket = storage_client.bucket('my_bucket')

    # Blob (content)
    blob = my_bucket.blob(destination)
    blob.content_encoding = 'gzip'

    # Write to storage
    blob.upload_from_string(contents, content_type='application/json')

However, I receive an error when running the function:

FileNotFoundError: [Errno 2] No such file or directory: 'path/to/name.json.gz'

Highlighting this line as the cause:

with gzip.open(destination, "wb") as f:

I can confirm that the bucket and path both exist although the file itself is new and to be written.

I can also confirm that removing the Gzipping part sees the file successfully written to Cloud Storage.

How can I gzip a new file and upload to Cloud Storage?

Other answers I've used for reference:

ianyoung · Accepted Answer

Although @David's answer wasn't complete at the time of solving my problem, it got me on the right track. Here's what I ended up using along with explanations I found out along the way.

import gzip

from google.cloud import storage
from google.cloud.storage import fileio 

def upload_to_cloud_storage(json_string):
    """Gzip and write to Cloud Storage."""

    storage_client = storage.Client()
    bucket = storage_client.bucket('my_bucket')

    # Filename (include path)
    blob = bucket.blob('path/to/file.json')

    # Set blog meta data for decompressive transcoding
    blob.content_encoding = 'gzip'
    blob.content_type = 'application/json'

    writer = fileio.BlobWriter(blob)

    # Must write as bytes
    gz = gzip.GzipFile(fileobj=writer, mode="wb")

    # When writing as bytes we must encode our JSON string.
    gz.write(json_string.encode('utf-8'))

    # Close connections
    gz.close()
    writer.close()

We use the GzipFile() class instead of convenience method (compress) to enable us to pass in the mode. When trying to write using w or wt you will receive the error:

TypeError: memoryview: a bytes-like object is required, not 'str'

So we must write in binary mode (wb). This will also enable the .write() method. When doing so however we need to encode our JSON string. This can be done using str.encode() and setting it as utf-8. Failing to do this will also result in the same error.

Finally, I wanted to be able to enable decompressive transcoding where the requester (browser in my case) will receive the uncompressed version of the file when requested. To enable this google.cloud.storage.blob allows you to set some meta data including content_type and content_encoding so we can can follow best practices.

This sees the JSON object in memory written to your chosen destination in Cloud Storage in a compressed format and decompressed on the fly (without needing to download a gzip archive).

Thanks also to @JohnHanley for the troubleshooting advice.

Gzip a file in Python before uploading to Cloud Storage

Answers (2)

Related Questions