Reputation: 1351
I have an Azure Storage account where I am storing my log files which are coming from my Azure Diagnostic. These log files are stored in multiple folders with hours and minutes. for ex: one of my file path in blob storage is like this
resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=00/
I would like to know steps on how to Query multiple files from Multiple folder at a time. for example if I have to Query data from Day 23 to Day 24 , Whats the best way to do it in Databricks.These folders contain json file with multiples lines of Json.Thanks
Upvotes: 1
Views: 569
Reputation: 448
If you want to read all available files you can just use wildcards.
path = "resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=*/m=*/d=*/h=*/m=*/*"
spark.read.option("header","true").format("csv").load(pathList)
If you only want to read a specific set of files, it would be best to generate a list of the paths you want to read, which you can use in the spark reading function.
pathList = [
"resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=00/",
"resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=01/"
]
spark.read.option("header","true").format("csv").load(pathList)
The pathList in this example you could generate programmatically according to the what files you want to process, e.g.
pathList = []
for i in range(24):
newPath = f"resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h={i}/m=01/"
pathList.append(newPath)
spark.read.option("header","true").format("csv").load(pathList)
This example would read every hour (0-23) from the date 2022-05-23 at minute 1.
Upvotes: 1