Jakub Szlaur
Jakub Szlaur

Reputation: 2132

Why running once df.read_csv method results in two .csv reads?

This is my code:

import dask.dataframe as df

data_frame = df.read_csv(normal_numerical_path, blocksize=None)
data_frame = data_frame.dropna(how='all')

bad_samples = data_frame[data_frame['Response'] == 1].shape[0].compute()
good_samples = data_frame[data_frame['Response'] == 0].shape[0].compute()

I can see in my Dask Client that the .csv file is read two times: enter image description here

Why does Dask read my .csv file twice, when I call only once the .read_csv method?


Other info:

This is my setup code:

from dask.distributed import Client, progress

client = Client(n_workers=2, threads_per_worker=2, memory_limit='6GB')
client

Upvotes: 1

Views: 55

Answers (1)

Nick Becker
Nick Becker

Reputation: 4214

I have only 16GB or RAM and my dataset is too large to fit in there at once - does this mean that everytime I call the .compute() method on the data_frame the .csv file needs to be read?

Thus now resulting in two long reads?

Yes, with caveats.

Each time you call compute on an object Dask fully executes the task graph for that object. As your data is not cached in memory by default, it will read the data each time. You can explicitly compute both results at once with client.compute, which will avoid the redundant work.

import dask.dataframe as dd
from dask.datasets import timeseries
from distributed import Client

client = Client()

# generate fake data for illustration
timeseries().to_csv("example");

ddf = dd.read_csv("example/*.part")

sarah = ddf.loc[ddf.name == "Sarah"]
kevin = ddf.loc[ddf.name == "Kevin"]

# compute both dataframes at once, so dask doesnt waste computation
# note that these are futures, which when finished let you access
# the "result"
sarah, kevin = client.compute([sarah, kevin])
print(sarah.result().head())
               timestamp    id   name         x         y
48   2000-01-01 00:00:48  1021  Sarah  0.227437 -0.105553
53   2000-01-01 00:00:53   933  Sarah  0.400616 -0.549600
66   2000-01-01 00:01:06  1021  Sarah -0.866886  0.762793
121  2000-01-01 00:02:01   994  Sarah  0.316168  0.305508
126  2000-01-01 00:02:06  1024  Sarah  0.978563  0.647163

Upvotes: 3

Related Questions