Reputation: 9
I want to list in Databricks all file names located in an Azure Blob Storage.
My Azure Blob Storage is structured like this:
aaa
<br/>------bbb
<br/>------------bbb1.xml
<br/>------------bbb2.xml
<br/>------ccc
<br/>------------ccc1.xml
<br/>------------ccc2.xml
<br/>------------ccc3.xml
If I do:
dbutils.fs.ls('wasbs://[email protected]/aaa')
only subfolders bbb and ccc are listed like this:
[FileInfo(path='wasbs://[email protected]/aaa/bbb/', name='bbb/', size=0),
FileInfo(path='wasbs://[email protected]/aaa/ccc/', name='ccc/', size=0)]
I want to deepen to the last subfolder to see all file names located in aaa: bbb1.xml
, bbb2.xml
, ccc1.xml
, ccc2.xml
and ccc3.xml
.
If I do:
dbutils.fs.ls('wasbs://[email protected]/aaa/*')
an error occurs because the path can not be parameterized.
Any idea to do this in Databricks?
Upvotes: 1
Views: 2276
Reputation: 87369
dbutils.fs.ls
doesn't support wildcards, that's why you're getting an error. You have few choices:
Use Python SDK for Azure blob storage to list files - it could be faster than using recursive dbutils.fs.ls
, but you will need to setup authentication, etc.
You can do recursive calls to dbutils.fs.ls
, using function like this, but it's not very performant:
def list_files(path, max_level = 1, cur_level=0):
"""
Lists files under the given path, recursing up to the max_level
"""
d = dbutils.fs.ls(path)
for i in d:
if i.name.endswith("/") and i.size == 0 and cur_level < (max_level - 1):
yield from list_files(i.path, max_level, cur_level+1)
else:
yield i.path
Upvotes: 1