Martin Thoma
Martin Thoma

Reputation: 136197

Why does Dask not read the CSV?

I just tried

import dask.dataframe as dd
df = dd.read_csv("data.csv")
print(df.describe())

which gives

Dask DataFrame Structure:
              SOME_COL    FOO            BAR
npartitions=1                   float64     float64        float64
              ...         ...            ...  
Dask Name: describe, 1234 tasks

There are two problems:

  1. I don't think anything was done as this is a 4GB CSV file and thus it should take at least a couple of seconds to read, but the print occurs immediately.
  2. I expected to get the min, 25%, median, 75% and max, but none of those descriptive values is shown.

What is the problem?

Upvotes: 0

Views: 872

Answers (2)

Klemen Koleša
Klemen Koleša

Reputation: 446

Calling dd.read_csv() does not actually do much. After this you should call .compute() method to actually read csv into dask dataframe.

This means dask is lazy. If you have only 4GB csv file and enough RAM maybe you can read csv in chunks directly with pandas. Also set parameter low_memory=False in pandas.read_csv.

Upvotes: 0

MRocklin
MRocklin

Reputation: 57251

Dask.dataframe is lazy by default. You need to call .compute() when you want a real answer.

print(df.describe().compute())

Upvotes: 1

Related Questions