Reputation: 40618
Is there a way to instruct dask to keep intermediate values when performing expensive computations?
In the example below, I would like dask to preserve the intermediate column d['c']
created when calculating d['d']
.
## very large file
d = ddf.read_csv("F:/tmp.csv")
d['c'] = d['a'] * d['b']
d['d'] = d['c'] + 1
## first call
%timeit d['d'].value_counts().compute()
## second call takes roughly the same time
%timeit d['d'].value_counts().compute()
Yet in my experiments, it seems to be calculating d['c']
each time. Is there a way to tell dask to keep d['c']
hanging around somewhere? What is the best practice for this kind of worklow? I plan on creating a lot of intermediate columns to use in many subsequent computations and don't want to calculate them from scratch each time. Or is my understanding completely wrong?
Upvotes: 3
Views: 279
Reputation: 57251
You can call compute on many things at the same time to share intermediate results
dask.compute(d.min(), d.max())
You can use the .persist()
method or the dask.persist(...)
function to compute results, but keep them as dask collections
d['c'] = d['a'] * d['b']
d['d'] = (d['c'] + 1).persist()
or
d['c'] = d['a'] * d['b']
d['d'] = d['c'] + 1
d = d.persist()
If you are using the first-generation single-machine scheduler then you can use opportunistic caching. See http://dask.pydata.org/en/latest/caching.html for more information.
Upvotes: 5