Santosh Kumar
Santosh Kumar

Reputation: 791

How to create dask dataframe from CSV file stored in HDFS(many part files)

I am trying to create dask dataframe from HDFS file(csv). The csv file stored in HDFS has many part files.

On read_csv api call:

dd.read_csv("hdfs:<some path>/data.csv")

Following error occurs:

OSError: Could not open file: <some path>/data.csv, mode: rb Path is not a file: <some path>/data.csv

In fact /data.csv is directory containing many part files. I'm not sure if there is some different API to read such hdfs csv.

Upvotes: 1

Views: 894

Answers (1)

mdurant
mdurant

Reputation: 28673

Dask does not know which files you intend to read from when you pass only a directory name. You should pass a glob string uses to search for files or an explicit list of files, e.g.,

df = dd.read_csv("hdfs:///some/path/data.csv/*.csv")

Note the leading '/' after the colon: all hdfs paths begin this way.

Upvotes: 2

Related Questions