m1nkeh
m1nkeh

Reputation: 1397

Databricks list all blobs in Azure Blob Storage

I have mounted a Blob Storage Account in to Databricks, and can access it fine, so i know that it works.

What i want to do though, is list out the names all of the files at a given path.. currently i'm doing this with:

list = dbutils.fs.ls('dbfs:/mnt/myName/Path/To/Files/2019/03/01')
df = spark.createDataFrame(list).select('name')

The issue i have though, is that it's exceptionally slow.. due to there being around 160,000 blobs at that location (storage explorer shows this as ~1016106592 bytes which is 1Gb!)

This surely can't be pulling down all this data, all i need/want is the filename..

Is blob storage my bottle neck, or can i (somehow) get Databricks to execute the command in parallel or something?

Thanks.

Upvotes: 2

Views: 9667

Answers (1)

Peter Pan
Peter Pan

Reputation: 24138

Per my experience and based on my understanding for Azure Blob Storage, all operations in SDK or others on Azure Blob Storage will be translated to REST API calling. So your dbutils.fs.ls calling is actually calling the related REST API List Blobs on a Blob container.

Therefore, I'm sure the performance neck of your code is really affected by transfering the data of amount size of the XML response body of blobs list on Blob Storage to extract blob names to the list variable , even there is around 160,000 blobs.

Meanwhile, all blob names will be wrapped in many slices of XML response, and there is a MaxResults limit per slice, and to get next slice is depended on the NextMarker value of previous slice. The above reason is why to list blobs slow, and it can not be parallelism.

My suggestion for enhancing the efficiency of loading blob list is to cache the result of list blobs in advance, such as to generate a blob to write the blob list line by line. Considering for realtime update, you can try to use Azure Function with Blob Trigger to add the blob name record to an Append Blob when an event of blob creation happened.

Upvotes: 2

Related Questions