Mark Horvath
Mark Horvath

Reputation: 1166

Does dask dataframe.persist() keep results for the next query?

I'm trying to understand how df.persist() works in dask. Would I build the same expression again, would it recalculate it or load it from cache?

E.g. what happens when I do:

ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.sum().compute())
del ddf
ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.mean().compute())

Does dask read the .csv and shift by one twice, or the second time it comes from cache? Do I need the second .persist()? If it keeps it in cache, how do I force cleaning the cache?

Upvotes: 4

Views: 3041

Answers (1)

MRocklin
MRocklin

Reputation: 57271

When you call persist it keeps the data in distributed memory so that you don't need to recompute that part of the computation a second time.

You can release the memory by deleting the collection, as you do in line 3.

If you delete the collection, then yes, you will need to re-persist the intermediate results.

https://distributed.dask.org/en/latest/memory.html

Upvotes: 5

Related Questions