Zelazny7
Zelazny7

Reputation: 40618

Keep intermediate DataFrame computations using dask

Is there a way to instruct dask to keep intermediate values when performing expensive computations?

In the example below, I would like dask to preserve the intermediate column d['c'] created when calculating d['d'].

## very large file
d = ddf.read_csv("F:/tmp.csv")

d['c'] = d['a'] * d['b']

d['d'] = d['c'] + 1


## first call
%timeit d['d'].value_counts().compute()

## second call takes roughly the same time
%timeit d['d'].value_counts().compute()

Yet in my experiments, it seems to be calculating d['c'] each time. Is there a way to tell dask to keep d['c'] hanging around somewhere? What is the best practice for this kind of worklow? I plan on creating a lot of intermediate columns to use in many subsequent computations and don't want to calculate them from scratch each time. Or is my understanding completely wrong?

Upvotes: 3

Views: 279

Answers (1)

MRocklin
MRocklin

Reputation: 57251

Call multiple results at the same time

You can call compute on many things at the same time to share intermediate results

dask.compute(d.min(), d.max())

Use persist to keep data in memory

You can use the .persist() method or the dask.persist(...) function to compute results, but keep them as dask collections

d['c'] = d['a'] * d['b']
d['d'] = (d['c'] + 1).persist()

or

d['c'] = d['a'] * d['b']
d['d'] = d['c'] + 1
d = d.persist()

Opportunistic caching

If you are using the first-generation single-machine scheduler then you can use opportunistic caching. See http://dask.pydata.org/en/latest/caching.html for more information.

Upvotes: 5

Related Questions