Reputation: 725
I am working in containers against fairly large blobs. The data is log data and each line is several kb. I expected BlockBlobService.get_blob_to_stream to operate like a stream, but it downloads the whole thing. How can I actually stream the blob ala:
with some_method as blobStream:
for line in blobStream:
<do something with line>
For my needs I cannot download the entire stream or hold it in memory I just need a line at a time.
Upvotes: 0
Views: 2398
Reputation: 725
I wound up setting the blob as public and doing the following since it is text. Not super ideal but it gets the job done.
import urllib.request
for line in urllib.request.urlopen(blob_url):
<do something with line>
EDIT
One problem I had to contend with was socket loss because I was working against a live blob. I had to take a snapshot and use a session or I was gettign disconnects after about 5 seconds of processing.
snapshotBlob = append_blob_service.snapshot_blob(container, blobName)
_params = {
'snapshot': snapshotBlob.snapshot,
'timeout': 20000,
}
s = requests.Session()
r = s.get(target_url, params=_params, stream=True, timeout=20000)
for line in in r.iter_lines():
<do something with line>
Upvotes: 0
Reputation: 23792
You could use Range
or x-ms-range
in Get Blob Request Headers to return only the bytes of the blob in the specified range.
You could implement it by start_range
and end_range
parameters in the Python Storage SDK.
For example, one 1GB blob file is split into 100 requests, send 100 requests per unit time.Then download it into local file for subsequent processing. Please ensure bytes wrote with right location. But this requires the system to have 1GB of memory space for its use.
A more optimized approach is recommended that you could read the log file into memory at the same time when downloading every capacity that satisfies the quota
.
For example, a blob is divided into 100 requests, send 5 requests per unit time. Run 20 times in order. Per 5 requests is written to memory and the following 5 requests will be sent at the same time. In this way ,system just allocates approximately quota
of memory space.
In view of the network instability resulting in the requsts interrupt need to be rewritten in the request range of bytes, I suggest that you divide file into more parts.
Hope it helps you.
Upvotes: 0
Reputation: 136346
As such no functionality exist that will enable you to read a blob line by line. You would need to come up with your own solution for that.
However you can certainly read partial content of the blob when using get_blob_to_stream
method. If you see this method's signature:
def get_blob_to_stream(
self, container_name, blob_name, stream, snapshot=None,
start_range=None, end_range=None, validate_content=False,
progress_callback=None, max_connections=2, lease_id=None,
if_modified_since=None, if_unmodified_since=None, if_match=None,
if_none_match=None, timeout=None):
You will notice that it has two parameters (start_range
and end_range
). These two parameters will enable you to read partial blob content instead of reading the whole blob.
What you could do is read a chunk of data (say 1MB at a time) and then build some logic to break this data by line.
Upvotes: 3