Reputation: 523
I am parallelising some code using dask.distributed
(embarrassingly parallel task).
.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2, threads_per_worker=1,memory_limit =8e9)
client = Client(cluster)
.
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.
Perhaps some other process is leaking memory? Process memory: 6.21 GB -- Worker memory limit: 8.00 GB
suggesting that part of the RAM used by the worker is non freed
between the different files (I guess are leftover filtering intermediates....)
Is there a way to free the memory of the worker before starting the processing of the next image? do I have to run a garbage collector
cycle in between running tasks?
I included gc.collect()
call at the end of the function run by the worker but didn't eliminate the warnings.
Thanks a lot for the help!
Upvotes: 10
Views: 1601
Reputation: 382
The "Memory use is high" error message could be pointing to a few potential culprits. I found this article by one of the core Dask maintainers helpful in diagnosing and fixing the issue.
Quick summary, either:
malloc_trim
(esp. if working with non-NumPy data or small NumPy chunks)Make sure you can see the Dask Dashboard while your computations are running to figure out which approach is working.
Upvotes: 1
Reputation: 4366
As long as the reference count for a distributed value is held by a client the cluster won't purge it from memory. This is expounded on in the Managing Memory documentation, specifically the "Clearing data" section.
Upvotes: 1