Reputation: 109
I've started to work with Databricks python notebooks recently and can't understand how to read multiple .csv
files from DBFS as I did in Jupyter notebooks earlier.
I've tried:
path = r'dbfs:/FileStore/shared_uploads/path/'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
li.append(df)
data = pd.concat(li, axis=0, ignore_index=True)
This code worked perfectly in Jupyter notebooks, but in Databricks, I receive this error:
ValueError: No objects to concatenate
I can reach one file in this path using df = pd.read_csv('dbfs_path/filename.csv')
Thanks!
Upvotes: 1
Views: 6189
Reputation: 2334
When you are reading DBFS location , we should read through dbutils
command as like this .
files = dbutils.fs.ls('/FileStore/shared_uploads/path/')
li = []
for fi in files:
print(fi.path)
<your logic here>
Upvotes: 2
Reputation: 964
You need to change path
to r'/dbfs/FileStore/shared_uploads/path/'
The glob
function will work with the raw filesystem attached to the driver, and has no notion of what dbfs:
means.
Also, since you are combining a lot of csv files, why not read them in directly with spark:
path = r'dbfs:/FileStore/shared_uploads/path/*.csv'
df = spark.read.csv(path)
Upvotes: 2