Reputation: 791
I am trying to create dask dataframe from HDFS file(csv). The csv file stored in HDFS has many part files.
On read_csv api call:
dd.read_csv("hdfs:<some path>/data.csv")
Following error occurs:
OSError: Could not open file: <some path>/data.csv, mode: rb Path is not a file: <some path>/data.csv
In fact /data.csv is directory containing many part files. I'm not sure if there is some different API to read such hdfs csv.
Upvotes: 1
Views: 894
Reputation: 28673
Dask does not know which files you intend to read from when you pass only a directory name. You should pass a glob string uses to search for files or an explicit list of files, e.g.,
df = dd.read_csv("hdfs:///some/path/data.csv/*.csv")
Note the leading '/'
after the colon: all hdfs paths begin this way.
Upvotes: 2