Reputation: 1397
I have mounted a Blob Storage Account in to Databricks, and can access it fine, so i know that it works.
What i want to do though, is list out the names all of the files at a given path.. currently i'm doing this with:
list = dbutils.fs.ls('dbfs:/mnt/myName/Path/To/Files/2019/03/01')
df = spark.createDataFrame(list).select('name')
The issue i have though, is that it's exceptionally slow.. due to there being around 160,000 blobs at that location (storage explorer shows this as ~1016106592 bytes which is 1Gb!)
This surely can't be pulling down all this data, all i need/want is the filename..
Is blob storage my bottle neck, or can i (somehow) get Databricks to execute the command in parallel or something?
Thanks.
Upvotes: 2
Views: 9667
Reputation: 24138
Per my experience and based on my understanding for Azure Blob Storage, all operations in SDK or others on Azure Blob Storage will be translated to REST API calling. So your dbutils.fs.ls
calling is actually calling the related REST API List Blobs
on a
Blob container.
Therefore, I'm sure the performance neck of your code is really affected by transfering the data of amount size of the XML
response body of blobs list on Blob Storage to extract blob names to the list
variable , even there is around 160,000 blobs.
Meanwhile, all blob names will be wrapped in many slices of XML response, and there is a MaxResults
limit per slice, and to get next slice is depended on the NextMarker
value of previous slice. The above reason is why to list blobs slow, and it can not be parallelism.
My suggestion for enhancing the efficiency of loading blob list is to cache the result of list blobs in advance, such as to generate a blob to write the blob list line by line. Considering for realtime update, you can try to use Azure Function with Blob Trigger to add the blob name record to an Append Blob when an event of blob creation happened.
Upvotes: 2