Databricks, dbutils, get filecount and filesize of all subfolders in Azure Data Lake gen 2 path

Question

I'm coding in a Databricks notebook (pyspark) and trying to get the filecount and filesizes of all subfolders in a specific Azure Data Lake gen2 mount path using dbutils.

I have code for it on a specific folder but I'm stuck on how to write the recursive part...

pltc · Accepted Answer

How about this?

def deep_ls(path: str):
    """List all files in base path recursively."""
    for x in dbutils.fs.ls(path):
        if x.path[-1] is not '/':
            yield x
        else:
            for y in deep_ls(x.path):
                yield y

Credits to

https://forums.databricks.com/questions/18932/listing-all-files-under-an-azure-data-lake-gen2-co.html

https://gist.github.com/Menziess/bfcbea6a309e0990e8c296ce23125059

Databricks, dbutils, get filecount and filesize of all subfolders in Azure Data Lake gen 2 path

Answers (2)

Related Questions