How to select a subset of Avro files from Azure Data Lake Gen2 by data content

Question

I have lots of Avro files in an Azure Data Lake Gen2 storage sent by an Event Hub service with capture enabled. These Avro files contain data from different sensors and engines. The structure of the directory is organized in folders with the following path format (typical of Azure Blobs):

namespace/eventhub/partition/year/month/day/hour/minute/file.avro

I need to access to some of these files, in order to get data to 1) pre-process and 2) train or re-train a machine learning model. I'd like to know what procedure could I follow to download or mount just the files containing data of a particular engine and/or sensor, given that not data from all of them are present in all Avro files. Let's assume I'm interested just in files containing data from:

Engine = engine_ID_4012
Sensor = sensor_engine_4012_ID_0114

I'm aware that Spark offers some advantages working with Avro files, so I could consider to carry out this task using Databricks. Otherwise the option is Azure Machine Learning service, but maybe there are other possiblities, for instance a combination. The goal is to speed up the data ingestion process, avoiding to read files with no needed data.

Thanks.

How to select a subset of Avro files from Azure Data Lake Gen2 by data content

Answers (1)

Related Questions