list the files of a directory and subdirectory recursively in Databricks(DBFS)

Using python/dbutils, how to display the files of the current directory & subdirectory recursively in Databricks file system(DBFS).

Upvotes: 20

Answers (5)

zerweck

Reputation: 736

A much faster solution is to use spark.read to recursively read in all file names into a DataFrame, then extract the paths into a list via rdd.flatMap:

folder = "root_path/"
df = (
    spark.read
    .option("recursiveFileLookup", "true")
    .format("binaryFile")
    .load(folder)
)

df.select('path').rdd.flatMap(lambda x: x).collect()

The reason this works:

The native spark.read recursive option does all the slow recursive work for you much faster than a python function
Using binaryFile as file format creates a single record in the DataFrame per file
It also creates the "path" column
The lazy loading in combination with select('path') ensures that all your big binary blobs are not actually loaded into RAM before collecting the list

Upvotes: 1

Marcin Skotis

Reputation: 41

You could also try this recursive function:

def lsR(path):
    return [
        fname
        for flist in [
            ([fi.path] if fi.isFile() else lsR(fi.path))
            for fi in dbutils.fs.ls(path)
        ]
        for fname in flist
    ]


lsR("/your/folder")

Upvotes: 4

Doug

Reputation: 35206

There are other answers listed here, but it worth noting that databricks stores datasets as folders.

For example, you might have a 'directory' called my_dataset_here, which contains files like this:

my_dataset_here/part-00193-111-c845-4ce6-8714-123-c000.snappy.parquet
my_dataset_here/part-00193-123-c845-4ce6-8714-123-c000.snappy.parquet
my_dataset_here/part-00193-222-c845-4ce6-8714-123-c000.snappy.parquet
my_dataset_here/part-00193-444-c845-4ce6-8714-123-c000.snappy.parquet
...

There will be thousands of these such files in a typical set of tables.

Attempting to enumerate every single file in such a folder can take a very long time... like, minutes, because the single call to dbutils.fs.ls must return an array of every single result.

Therefore, a naive approach such as:

stack = ["/databricks-datasets/COVID/CORD-19/2020-03-13"]
while len(stack) > 0:
  current_folder = stack.pop(0)
  for file in dbutils.fs.ls(current_folder):
    if file.isDir():
      stack.append(file.path)
      print(file.path)
    else:
      print(file.path)

Will indeed list every file, but it will also take forever to finish. In my test environment, enumerating over 50 odd tables took 8 minutes.

However, the new 'delta' format, if used, creates a standard named folder called '_delta_log' inside delta table folders.

We can therefore modify our code to check each folder to see if it is a dataset before attempting to enumerate the entire contents of the folder:

stack = ["/databricks-datasets/COVID/CORD-19/2020-03-13"]
while len(stack) > 0:
  current_folder = stack.pop(0)
  for file in dbutils.fs.ls(current_folder):
    if file.isDir():
      # Check if this is a delta table and do not recurse if so!
      try:
        delta_check_path = f"{file.path}/_delta_log"
        dbutils.fs.ls(delta_check_path)  # raises an exception if missing
        print(f"dataset: {file.path}")
      except:            
        stack.append(file.path)
        print(f"folder: {file.path}")
    else:
        print(f"file: {file.path}")

This code runs on the same test environment in 38 seconds.

In trivial situations, the naive solution is acceptable, but it quickly becomes totally unacceptable in real world situations.

Notice that this code will only work on delta tables; if you are using parquet/csv/whatever format, you're out of luck.

Upvotes: 1

choeh

Reputation: 129

An alternative implementation can be done with generators and yield operators. You have to use at least Python 3.3+ for yield from operator and check out this great post for a better understanding of yield operator:

def get_dir_content(ls_path):
    for dir_path in dbutils.fs.ls(ls_path):
        if dir_path.isFile():
            yield dir_path.path
        elif dir_path.isDir() and ls_path != dir_path.path:
            yield from get_dir_content(dir_path.path)
    
list(get_dir_content('/databricks-datasets/COVID/CORD-19/2020-03-13'))

Upvotes: 12

Daniel

Reputation: 1242

Surprising thing about dbutils.fs.ls (and %fs magic command) is that it doesn't seem to support any recursive switch. However, since ls function returns a list of FileInfo objects it's quite trivial to recursively iterate over them to get the whole content, e.g.:

def get_dir_content(ls_path):
  dir_paths = dbutils.fs.ls(ls_path)
  subdir_paths = [get_dir_content(p.path) for p in dir_paths if p.isDir() and p.path != ls_path]
  flat_subdir_paths = [p for subdir in subdir_paths for p in subdir]
  return list(map(lambda p: p.path, dir_paths)) + flat_subdir_paths
    

paths = get_dir_content('/databricks-datasets/COVID/CORD-19/2020-03-13')
[print(p) for p in paths]

Upvotes: 21

list the files of a directory and subdirectory recursively in Databricks(DBFS)

Answers (5)

Related Questions