ZZZSharePoint
ZZZSharePoint

Reputation: 1351

Querying multiple files in multiple folders in Azure Storage account using Azure Databricks

I have an Azure Storage account where I am storing my log files which are coming from my Azure Diagnostic. These log files are stored in multiple folders with hours and minutes. for ex: one of my file path in blob storage is like this

resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=00/

I would like to know steps on how to Query multiple files from Multiple folder at a time. for example if I have to Query data from Day 23 to Day 24 , Whats the best way to do it in Databricks.These folders contain json file with multiples lines of Json.Thanks

Upvotes: 1

Views: 569

Answers (1)

restlessmodem
restlessmodem

Reputation: 448

If you want to read all available files you can just use wildcards.

path = "resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=*/m=*/d=*/h=*/m=*/*"
spark.read.option("header","true").format("csv").load(pathList)

If you only want to read a specific set of files, it would be best to generate a list of the paths you want to read, which you can use in the spark reading function.

pathList = [
  "resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=00/",
  "resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=01/"
]
spark.read.option("header","true").format("csv").load(pathList)

The pathList in this example you could generate programmatically according to the what files you want to process, e.g.

pathList = []
for i in range(24):
  newPath = f"resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h={i}/m=01/"
  pathList.append(newPath)

spark.read.option("header","true").format("csv").load(pathList)

This example would read every hour (0-23) from the date 2022-05-23 at minute 1.

Upvotes: 1

Related Questions