Reputation: 578
Can I compress data from Azure Blob to gzip as I download it? I would like to avoid having all data in memory if possible.
I tried two different approaches (compress_chunk and compress_blob) functions. I am not sure if the entire blob was in memory though before compression, or if I can compress it as it is read in somehow.
def compress_chunk(data):
data.seek(0)
compressed_body = io.BytesIO()
compressor = gzip.open(compressed_body, mode='wb')
while True:
chunk = data.read(1024 * 1024 * 4)
if not chunk:
break
compressor.write(chunk)
compressor.flush()
compressor.close()
compressed_body.seek(0, 0)
return compressed_body
def compress_blob(data):
compressed_body = gzip.compress(data.getvalue())
return compressed_body
def process_download(container_name, blob):
with io.BytesIO() as input_io:
blob_service.get_blob_to_stream(container_name=container_name, blob_name=blob.name, stream=input_io)
compressed_body = compress_chunk(data=input_io)
Upvotes: 0
Views: 821
Reputation: 14113
I think you know how to compress data. So the following is just to make some clarifications.
I am not sure if the entire blob was in memory though before compression.
When we need to download the blob data for processing, we use the official method to download the blob. At this time, it is in the form of a stream. It is not on the disk, but of course it will use the memory allocated by the program.
Azure didn't provide a method to pre-compress data on azure:
Therefore, when we want to process data, we must first download it, and when it is downloaded as a stream, it will of course take up memory.
Upvotes: 1