I just tried import dask.dataframe as dd df = dd.read_csv("data.csv") print(df.describe()) which gives Dask DataFrame Structure: SOME_COL FOO BAR npartitions=1 float64 float64 float64 ... ... ... Dask Name: describe, 1234 tasks There are two problems: I don't think anything was done as this is a 4GB CSV file and thus it should take at least a couple of seconds to read, but the print occurs immediately. I expected to get the min, 25%, median, 75% and max, but none of those descriptive values is shown. What is the problem?

Reputation: 136725

Why does Dask not read the CSV?

I just tried

import dask.dataframe as dd
df = dd.read_csv("data.csv")
print(df.describe())

which gives

Dask DataFrame Structure:
              SOME_COL    FOO            BAR
npartitions=1                   float64     float64        float64
              ...         ...            ...  
Dask Name: describe, 1234 tasks

There are two problems:

I don't think anything was done as this is a 4GB CSV file and thus it should take at least a couple of seconds to read, but the print occurs immediately.
I expected to get the min, 25%, median, 75% and max, but none of those descriptive values is shown.

What is the problem?

Upvotes: 0

Answers (2)

Klemen Koleša

Reputation: 446

Calling dd.read_csv() does not actually do much. After this you should call .compute() method to actually read csv into dask dataframe.

This means dask is lazy. If you have only 4GB csv file and enough RAM maybe you can read csv in chunks directly with pandas. Also set parameter low_memory=False in pandas.read_csv.

Upvotes: 0

MRocklin

Reputation: 57311

Dask.dataframe is lazy by default. You need to call .compute() when you want a real answer.

print(df.describe().compute())

Upvotes: 1

Why does Dask not read the CSV?

Answers (2)

Related Questions