How to read a big Azure blob storage file block by block

Question

I want to read a huge Azure blob storage file and stream its content to Event-Hub. I found this example,

from azure.storage.blob import BlockBlobService

bb = BlockBlobService(account_name='', account_key='')
container_name = ""
blob_name_to_download = "test.txt"
file_path ="/home/Adam/Downloaded_test.txt"

bb.get_blob_to_path(container_name, blob_name_to_download, file_path, open_mode='wb', 
 snapshot=None, start_range=None, end_range=None, validate_content=False, 
 progress_callback=None, max_connections=2, lease_id=None, 
 if_modified_since=None, if_unmodified_since=None, 
 if_match=None, if_none_match=None, timeout=None)

But in this way, you can't get blocks in a loop, which I want to do. So, how can I modify this code for my case?

Gaurav Mantri · Accepted Answer

If you notice, there are two parameters in get_blob_to_path method - start_range and end_range. These two parameters would allow you to read blob's data in chunk.

What you need to do is get the blob's properties first to find its length and then repeatedly call get_blob_xxx method to get data in chunks. I used get_blob_to_text method but you can see other methods here.

Here's the pseudo code I came up with. HTH.

bb = BlockBlobService(account_name='', account_key='')
container_name = ""
blob_name_to_download = "test.txt"
file_path ="/home/Adam/Downloaded_test.txt"

#First get blob properties. We would want to find out blob's content length
blob = bb.get_blob_properties()

#extract content length from blob's properties
blob_size = blob.properties.content_length

#now let's say we want to fetch 1MB chunk at a time so we loop and fetch 1MB content at a time.
start = 0
end = blob_size
chunk_size = 1 * 1024 * 1024 #1MB
do
    start_range = start
    end_range = start + chunk_size - 1
    blob_chunk_content = bb.get_blob_to_text(container_name, blob_name, 
        encoding='utf-8', snapshot=None, start_range=start_range, end_range=end_range, 
        validate_content=False, progress_callback=None, max_connections=2, 
        lease_id=None, if_modified_since=None, if_unmodified_since=None, 
        if_match=None, if_none_match=None, timeout=None)
    #blob_chunk_content will have 1 MB data. Do whatever you like with it.
    start = end_range + 1
while (start < end)

How to read a big Azure blob storage file block by block

Answers (2)

Related Questions