pelos
pelos

Reputation: 1876

how to read a big csv file in Azure Blob

I will get a HUGE csv file as a blob in azure, and I need to parse line by line, in an azure function

I am reading each of the blobs in my container and then I get it as a string, but I think that load everything, and then I split it by new lines. is there a smarter way to do this?

container_name = "test"
block_blob_service = BlockBlobService(account_name=container_name, account_key="mykey")
a = block_blob_service.get_container_properties(container_name)
generator = block_blob_service.list_blobs(container_name)

for b in generator:
    r = block_blob_service.get_blob_to_text(container_name, b.name)
    for i in r.content.split("\n"):
        print(i)

Upvotes: 2

Views: 2964

Answers (2)

pelos
pelos

Reputation: 1876

After reading other websites and modifying some of the code on the link above,

import io
import datetime
from azure.storage.blob import BlockBlobService

acc_name = 'myaccount'
acc_key = 'my key'
container = 'storeai'
blob = "orderingai2.csv"

block_blob_service = BlockBlobService(account_name=acc_name, account_key=acc_key)
props = block_blob_service.get_blob_properties(container, blob)
blob_size = int(props.properties.content_length)
index = 0
chunk_size =  104,858 # = 0.1meg don't make this to big or you will get memory error
output = io.BytesIO()


def worker(data):
    print(data)


while index < blob_size:
    now_chunk = datetime.datetime.now()
    block_blob_service.get_blob_to_stream(container, blob, stream=output, start_range=index, end_range=index + chunk_size - 1, max_connections=50)
    if output is None:
        continue
    output.seek(index)
    data = output.read()
    length = len(data)
    index += length
    if length > 0:
        worker(data)
        if length < chunk_size:
          break
    else:
      break

Upvotes: 1

Murray Foxcroft
Murray Foxcroft

Reputation: 13745

I am not sure how huge your huge is, but for very large files > 200MB or so I would use a streaming approach. The call get_blob_to_text downloads the entire file in one go and places it all in memory. Using get_blob_to_stream allows you to read line by line and process individually, with only the current line and your working set in memory. This is very fast and very memory efficient. We use a similar approach to split 1GB files in to smaller files. 1GB takes a couple of minutes to process.

Keep in mind that depending on your function app service plan the maximum execution time is 5 mins by default (you can increase this to 10 minutes in the hosts.json). Also, on consumption plan, you are limited to 1.5 GB memory on each function service (not per function - for all functions in your function PaaS). So be aware of these limits.

From the docs:

get_blob_to_stream(container_name, blob_name, stream, snapshot=None, start_range=None, end_range=None, validate_content=False, progress_callback=None, max_connections=2, lease_id=None, if_modified_since=None, if_unmodified_since=None, if_match=None, if_none_match=None, timeout=None)

Here is a good read on the topic

Upvotes: 1

Related Questions