Reputation: 160
I have lots of Avro files in an Azure Data Lake Gen2 storage sent by an Event Hub service with capture enabled. These Avro files contain data from different sensors and engines. The structure of the directory is organized in folders with the following path format (typical of Azure Blobs):
namespace/eventhub/partition/year/month/day/hour/minute/file.avro
I need to access to some of these files, in order to get data to 1) pre-process and 2) train or re-train a machine learning model. I'd like to know what procedure could I follow to download or mount just the files containing data of a particular engine and/or sensor, given that not data from all of them are present in all Avro files. Let's assume I'm interested just in files containing data from:
Engine = engine_ID_4012
Sensor = sensor_engine_4012_ID_0114
I'm aware that Spark offers some advantages working with Avro files, so I could consider to carry out this task using Databricks. Otherwise the option is Azure Machine Learning service, but maybe there are other possiblities, for instance a combination. The goal is to speed up the data ingestion process, avoiding to read files with no needed data.
Thanks.
Upvotes: 0
Views: 447
Reputation: 501
Thanks for reaching out. In azure machine learning, you can:
1. create a datastore to connect to your storage service (ADLS Gen2 in your case)
sample code: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data#azure-data-lake-storage-generation-2
2. create a filedataset from the adlsgen2 pointing to your avro files.
sample code: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets#create-a-filedataset
3. learn about how to download or mount those files on your compute in ML experiments:
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-with-datasets
Upvotes: 0