ldacey
ldacey

Reputation: 578

Compressing a stream from Azure Blob (Python SDK)

Can I compress data from Azure Blob to gzip as I download it? I would like to avoid having all data in memory if possible.

I tried two different approaches (compress_chunk and compress_blob) functions. I am not sure if the entire blob was in memory though before compression, or if I can compress it as it is read in somehow.

def compress_chunk(data):
    data.seek(0)
    compressed_body = io.BytesIO()
    compressor = gzip.open(compressed_body, mode='wb')
    while True:
        chunk = data.read(1024 * 1024 * 4)
        if not chunk:
            break
        compressor.write(chunk)
    compressor.flush()
    compressor.close()
    compressed_body.seek(0, 0)
    return compressed_body

def compress_blob(data):
    compressed_body = gzip.compress(data.getvalue())
    return compressed_body

def process_download(container_name, blob):
    with io.BytesIO() as input_io:
        blob_service.get_blob_to_stream(container_name=container_name, blob_name=blob.name, stream=input_io)
        compressed_body = compress_chunk(data=input_io)

Upvotes: 0

Views: 821

Answers (1)

suziki
suziki

Reputation: 14113

I think you know how to compress data. So the following is just to make some clarifications.

I am not sure if the entire blob was in memory though before compression.

When we need to download the blob data for processing, we use the official method to download the blob. At this time, it is in the form of a stream. It is not on the disk, but of course it will use the memory allocated by the program.

Azure didn't provide a method to pre-compress data on azure:

https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#methods

Therefore, when we want to process data, we must first download it, and when it is downloaded as a stream, it will of course take up memory.

Upvotes: 1

Related Questions