Reputation: 1166
I'm trying to understand how df.persist()
works in dask
. Would I build the same expression again, would it recalculate it or load it from cache?
E.g. what happens when I do:
ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.sum().compute())
del ddf
ddf = dask.dataframe.read_csv('my.csv').shift(1).persist()
print(ddf.mean().compute())
Does dask
read the .csv
and shift by one twice, or the second time it comes from cache? Do I need the second .persist()
? If it keeps it in cache, how do I force cleaning the cache?
Upvotes: 4
Views: 3041
Reputation: 57271
When you call persist it keeps the data in distributed memory so that you don't need to recompute that part of the computation a second time.
You can release the memory by deleting the collection, as you do in line 3.
If you delete the collection, then yes, you will need to re-persist the intermediate results.
https://distributed.dask.org/en/latest/memory.html
Upvotes: 5