Guillermo Colom
Guillermo Colom

Reputation: 9

List all file names located in an Azure Blob Storage

I want to list in Databricks all file names located in an Azure Blob Storage.
My Azure Blob Storage is structured like this:

aaa
<br/>------bbb
<br/>------------bbb1.xml
<br/>------------bbb2.xml
<br/>------ccc
<br/>------------ccc1.xml
<br/>------------ccc2.xml
<br/>------------ccc3.xml

If I do:

dbutils.fs.ls('wasbs://[email protected]/aaa')

only subfolders bbb and ccc are listed like this:

[FileInfo(path='wasbs://[email protected]/aaa/bbb/', name='bbb/', size=0), 
 FileInfo(path='wasbs://[email protected]/aaa/ccc/', name='ccc/', size=0)]

I want to deepen to the last subfolder to see all file names located in aaa: bbb1.xml, bbb2.xml, ccc1.xml, ccc2.xml and ccc3.xml.

If I do:

dbutils.fs.ls('wasbs://[email protected]/aaa/*')

an error occurs because the path can not be parameterized.

Any idea to do this in Databricks?

Upvotes: 1

Views: 2276

Answers (1)

Alex Ott
Alex Ott

Reputation: 87369

dbutils.fs.ls doesn't support wildcards, that's why you're getting an error. You have few choices:

  1. Use Python SDK for Azure blob storage to list files - it could be faster than using recursive dbutils.fs.ls, but you will need to setup authentication, etc.

  2. You can do recursive calls to dbutils.fs.ls, using function like this, but it's not very performant:

def list_files(path, max_level = 1, cur_level=0):
  """
  Lists files under the given path, recursing up to the max_level
  """
  d = dbutils.fs.ls(path)
  for i in d:
    if i.name.endswith("/") and i.size == 0 and cur_level < (max_level - 1):
      yield from list_files(i.path, max_level, cur_level+1)
    else:
      yield i.path
  1. You can use Hadoop API to access files in your container, similar to this answer.

Upvotes: 1

Related Questions