Reputation: 2132
This is my code:
import dask.dataframe as df
data_frame = df.read_csv(normal_numerical_path, blocksize=None)
data_frame = data_frame.dropna(how='all')
bad_samples = data_frame[data_frame['Response'] == 1].shape[0].compute()
good_samples = data_frame[data_frame['Response'] == 0].shape[0].compute()
I can see in my Dask Client
that the .csv
file is read two times:
Why does Dask
read my .csv
file twice, when I call only once the .read_csv
method?
.compute()
method on the data_frame
the .csv
file needs to be read?.shape[0]
method on the dataset without reading the .csv
file twice?from dask.distributed import Client, progress
client = Client(n_workers=2, threads_per_worker=2, memory_limit='6GB')
client
Upvotes: 1
Views: 55
Reputation: 4214
I have only 16GB or RAM and my dataset is too large to fit in there at once - does this mean that everytime I call the .compute() method on the data_frame the .csv file needs to be read?
Thus now resulting in two long reads?
Yes, with caveats.
Each time you call compute on an object Dask fully executes the task graph for that object. As your data is not cached in memory by default, it will read the data each time. You can explicitly compute both results at once with client.compute
, which will avoid the redundant work.
import dask.dataframe as dd
from dask.datasets import timeseries
from distributed import Client
client = Client()
# generate fake data for illustration
timeseries().to_csv("example");
ddf = dd.read_csv("example/*.part")
sarah = ddf.loc[ddf.name == "Sarah"]
kevin = ddf.loc[ddf.name == "Kevin"]
# compute both dataframes at once, so dask doesnt waste computation
# note that these are futures, which when finished let you access
# the "result"
sarah, kevin = client.compute([sarah, kevin])
print(sarah.result().head())
timestamp id name x y
48 2000-01-01 00:00:48 1021 Sarah 0.227437 -0.105553
53 2000-01-01 00:00:53 933 Sarah 0.400616 -0.549600
66 2000-01-01 00:01:06 1021 Sarah -0.866886 0.762793
121 2000-01-01 00:02:01 994 Sarah 0.316168 0.305508
126 2000-01-01 00:02:06 1024 Sarah 0.978563 0.647163
Upvotes: 3