Reputation: 30841
I want to read a huge Azure blob storage file and stream its content to Event-Hub. I found this example,
from azure.storage.blob import BlockBlobService
bb = BlockBlobService(account_name='', account_key='')
container_name = ""
blob_name_to_download = "test.txt"
file_path ="/home/Adam/Downloaded_test.txt"
bb.get_blob_to_path(container_name, blob_name_to_download, file_path, open_mode='wb',
snapshot=None, start_range=None, end_range=None, validate_content=False,
progress_callback=None, max_connections=2, lease_id=None,
if_modified_since=None, if_unmodified_since=None,
if_match=None, if_none_match=None, timeout=None)
But in this way, you can't get blocks in a loop, which I want to do. So, how can I modify this code for my case?
Upvotes: 1
Views: 1448
Reputation: 30841
Here is the Python version of Gaurav's Pseudocode. Note that I had to install the azure.storage.blob
package by using pip install azure-storage-blob==2.1.0
.
from azure.storage.blob import BlockBlobService
bb = BlockBlobService(account_name='<storage_account_name>', account_key='<sas_key>')
container_name = "<container_name>"
blob_name = "<dir>/<file>"
#First get blob properties. We would want to find out blob's content length
blob = bb.get_blob_properties(container_name=container_name, blob_name=blob_name)
#extract content length from blob's properties
blob_size = blob.properties.content_length
#now let's say we want to fetch 1MB chunk at a time,
# so we loop and fetch 1MB content at a time.
start = 0
end = blob_size
chunk_size = 1 * 1024 * 1024 # 1MB
while start < end:
start_range = start
end_range = start + chunk_size - 1
blob_chunk_content = bb.get_blob_to_text(container_name, blob_name,
encoding='utf-8', snapshot=None, start_range=start_range, end_range=end_range,
validate_content=False, progress_callback=None, max_connections=2,
lease_id=None, if_modified_since=None, if_unmodified_since=None,
if_match=None, if_none_match=None, timeout=None)
print(blob_chunk_content.content)
start = end_range + 1
Upvotes: 1
Reputation: 136334
If you notice, there are two parameters in get_blob_to_path
method - start_range
and end_range
. These two parameters would allow you to read blob's data in chunk.
What you need to do is get the blob's properties first to find its length and then repeatedly call get_blob_xxx
method to get data in chunks. I used get_blob_to_text
method but you can see other methods here
.
Here's the pseudo code I came up with. HTH.
bb = BlockBlobService(account_name='', account_key='')
container_name = ""
blob_name_to_download = "test.txt"
file_path ="/home/Adam/Downloaded_test.txt"
#First get blob properties. We would want to find out blob's content length
blob = bb.get_blob_properties()
#extract content length from blob's properties
blob_size = blob.properties.content_length
#now let's say we want to fetch 1MB chunk at a time so we loop and fetch 1MB content at a time.
start = 0
end = blob_size
chunk_size = 1 * 1024 * 1024 #1MB
do
start_range = start
end_range = start + chunk_size - 1
blob_chunk_content = bb.get_blob_to_text(container_name, blob_name,
encoding='utf-8', snapshot=None, start_range=start_range, end_range=end_range,
validate_content=False, progress_callback=None, max_connections=2,
lease_id=None, if_modified_since=None, if_unmodified_since=None,
if_match=None, if_none_match=None, timeout=None)
#blob_chunk_content will have 1 MB data. Do whatever you like with it.
start = end_range + 1
while (start < end)
Upvotes: 2