Alexander Farber
Alexander Farber

Reputation: 23018

How to parse AVRO blobs captured in a storage account of Event Hub?

In Microsoft Azure we have an Event Hub capturing JSON data and storing it in AVRO format in a blob storage account:

storage account screenshot

I have written a python script, which would fetch the AVRO files from the Event Hub:

import os, avro
from io import BytesIO
from operator import itemgetter, attrgetter
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

conn_str = 'DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net'
container_name = 'container1'

blob_service_client = BlobServiceClient.from_connection_string(conn_str)
container_client = blob_service_client.get_container_client(container_name)

blob_list = []
for blob in container_client.list_blobs():
    if blob.name.endswith('.avro'):
        blob_list.append(blob)

blob_list.sort(key=attrgetter('creation_time'), reverse=True)

This works well and I get a list of AVRO blobs, sorted by the creation time.

Now I am trying to add the final steps where I would download the blobs, parse the AVRO-formatted data and retrieve the JSON payload.

I try to retrieve each blob in the list into memory buffer and to parse it:

for blob in blob_list:
    blob_client = container_client.get_blob_client(blob.name)
    downloader = blob_client.download_blob()
    stream = BytesIO()
    downloader.download_to_stream(stream) # also tried readinto(stream)

    reader = DataFileReader(stream, DatumReader())
    for event_data in reader:
        print(event_data)
    reader.close()

Unfortunately, the above Python code does not work, nothing is printed.

I have also seen, that there is a StorageStreamDownloader.readall() method, but I am not sure, how to apply it.

I am using Windows 10, python 3.8.5 and avro 1.10.0 installed by pip.

Upvotes: 3

Views: 2057

Answers (1)

Ivan Glasenberg
Ivan Glasenberg

Reputation: 30015

When using readall() method, it should be used as below:

       with open("xxx", "wb+") as my_file: 
           my_file.write(blob_client.download_blob().readall()) # Write blob contents into the file.

For more details about reading captured eventhub data, you can refer to this official doc: Create a Python script to read your Capture files.

Please let me know if you still have more issues:).

Upvotes: 1

Related Questions