quervernetzt
quervernetzt

Reputation: 11621

Databricks / pyspark: How to get all full directory paths (that have at least one file as content) from Azure Blob storage recursively

Assume we have a virtual folder structure in an Azure Blob Storage container (that is mounted) as follows:

someroot
    2019
        01
            05
                somefile0
    2020
        11
            01
                somefile1
        12
            02
                somefile2
            03
                somefile3

As you can see the final level has no subfolders anymore (i.e. there is no combination of folders and files, just files).

How can I can get all full directory paths (excluding file paths) as flat list recursively?

Upvotes: 2

Views: 2422

Answers (1)

quervernetzt
quervernetzt

Reputation: 11621

Here is a solution that returns a flat list of full paths (excluding the file paths):

def get_all_directory_paths(base_path: str) -> list:
  """Get all full directory paths

  Parameters
  ----------
  base_path : str
      The starting path to search from

  Returns
  -------
  list
      Flat list of directory paths
  """
  
  all_paths: list = []
  
  def get_paths(base_path: str):
    dir_paths: list = dbutils.fs.ls(base_path)
    subdir_paths_test: list = [p.path for p in dir_paths if p.isDir()]
    if len(subdir_paths_test) == 0:
      all_paths.append(base_path)
    else:
      for p in dir_paths:
        if p.isDir():
          get_paths(p.path)
    
  get_paths(base_path)
  
  return all_paths

Upvotes: 2

Related Questions