Read multiple files from Databricks DBFS

Question

I've started to work with Databricks python notebooks recently and can't understand how to read multiple .csv files from DBFS as I did in Jupyter notebooks earlier.

I've tried:

path = r'dbfs:/FileStore/shared_uploads/path/' 
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
    li.append(df)

data = pd.concat(li, axis=0, ignore_index=True)

This code worked perfectly in Jupyter notebooks, but in Databricks, I receive this error: ValueError: No objects to concatenate

I can reach one file in this path using df = pd.read_csv('dbfs_path/filename.csv')

Thanks!

fskj · Accepted Answer

You need to change path to r'/dbfs/FileStore/shared_uploads/path/'

The glob function will work with the raw filesystem attached to the driver, and has no notion of what dbfs: means.

Also, since you are combining a lot of csv files, why not read them in directly with spark:

path = r'dbfs:/FileStore/shared_uploads/path/*.csv' 
df = spark.read.csv(path)

Read multiple files from Databricks DBFS

Answers (2)

Related Questions