distributed.worker - WARNING - Heartbeat to scheduler failed

Question

I'm running long running dask.delayed() jobs (that uses subprocess to run external binaries to process large files), and get cancelled Futures due to all workers losing their scheduler communication:

distributed.worker - WARNING - Heartbeat to scheduler failed

The scheduler says:

distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.99s.  
This is often caused by long-running GIL-holding functions or moving large chunks of data.
This can cause timeouts and instability.

Why does this happen, and how do I work around it/fix it? From my understanding, the scheduler doesn't run any of my python code itself...

distributed.worker - WARNING - Heartbeat to scheduler failed

Answers (1)

Related Questions