Vladislava Gonchar
Vladislava Gonchar

Reputation: 109

Read multiple files from Databricks DBFS

I've started to work with Databricks python notebooks recently and can't understand how to read multiple .csv files from DBFS as I did in Jupyter notebooks earlier.

I've tried:

path = r'dbfs:/FileStore/shared_uploads/path/' 
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
    li.append(df)

data = pd.concat(li, axis=0, ignore_index=True)

This code worked perfectly in Jupyter notebooks, but in Databricks, I receive this error: ValueError: No objects to concatenate

I can reach one file in this path using df = pd.read_csv('dbfs_path/filename.csv')

Thanks!

Upvotes: 1

Views: 6189

Answers (2)

When you are reading DBFS location , we should read through dbutils command as like this .

files = dbutils.fs.ls('/FileStore/shared_uploads/path/')
li = []
for fi in files: 
  print(fi.path)
  <your logic here>

Upvotes: 2

fskj
fskj

Reputation: 964

You need to change path to r'/dbfs/FileStore/shared_uploads/path/'

The glob function will work with the raw filesystem attached to the driver, and has no notion of what dbfs: means.

Also, since you are combining a lot of csv files, why not read them in directly with spark:

path = r'dbfs:/FileStore/shared_uploads/path/*.csv' 
df = spark.read.csv(path)

Upvotes: 2

Related Questions