Why running once df.read_csv method results in two .csv reads?

Question

This is my code:

import dask.dataframe as df

data_frame = df.read_csv(normal_numerical_path, blocksize=None)
data_frame = data_frame.dropna(how='all')

bad_samples = data_frame[data_frame['Response'] == 1].shape[0].compute()
good_samples = data_frame[data_frame['Response'] == 0].shape[0].compute()

I can see in my Dask Client that the .csv file is read two times:

This isn't ideal because the read takes the most time.
Can I somehow get rid of this?

Why does Dask read my .csv file twice, when I call only once the .read_csv method?

Other info:

I have only 16GB or RAM and my dataset is too large to fit in there at once - does this mean that everytime I call the .compute() method on the data_frame the .csv file needs to be read?
Thus now resulting in two long reads?
This behavior persists when I use even two workers - only one is running and doing all the job and the other one is not doing anything.
In this case wouldn't it be faster to use Pandas with chunking and call the .shape[0] method on the dataset without reading the .csv file twice?

This is my setup code:

from dask.distributed import Client, progress

client = Client(n_workers=2, threads_per_worker=2, memory_limit='6GB')
client

Nick Becker · Accepted Answer

I have only 16GB or RAM and my dataset is too large to fit in there at once - does this mean that everytime I call the .compute() method on the data_frame the .csv file needs to be read?

Thus now resulting in two long reads?

Yes, with caveats.

Each time you call compute on an object Dask fully executes the task graph for that object. As your data is not cached in memory by default, it will read the data each time. You can explicitly compute both results at once with client.compute, which will avoid the redundant work.

import dask.dataframe as dd
from dask.datasets import timeseries
from distributed import Client

client = Client()

# generate fake data for illustration
timeseries().to_csv("example");

ddf = dd.read_csv("example/*.part")

sarah = ddf.loc[ddf.name == "Sarah"]
kevin = ddf.loc[ddf.name == "Kevin"]

# compute both dataframes at once, so dask doesnt waste computation
# note that these are futures, which when finished let you access
# the "result"
sarah, kevin = client.compute([sarah, kevin])
print(sarah.result().head())
               timestamp    id   name         x         y
48   2000-01-01 00:00:48  1021  Sarah  0.227437 -0.105553
53   2000-01-01 00:00:53   933  Sarah  0.400616 -0.549600
66   2000-01-01 00:01:06  1021  Sarah -0.866886  0.762793
121  2000-01-01 00:02:01   994  Sarah  0.316168  0.305508
126  2000-01-01 00:02:06  1024  Sarah  0.978563  0.647163

Why running once df.read_csv method results in two .csv reads?

Other info:

This is my setup code:

Answers (1)

Related Questions