Egil Möller
Egil Möller

Reputation: 31

distributed.worker - WARNING - Heartbeat to scheduler failed

I'm running long running dask.delayed() jobs (that uses subprocess to run external binaries to process large files), and get cancelled Futures due to all workers losing their scheduler communication:

distributed.worker - WARNING - Heartbeat to scheduler failed

The scheduler says:

distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.99s.  
This is often caused by long-running GIL-holding functions or moving large chunks of data.
This can cause timeouts and instability.

Why does this happen, and how do I work around it/fix it? From my understanding, the scheduler doesn't run any of my python code itself...

Upvotes: 1

Views: 3624

Answers (1)

MRocklin
MRocklin

Reputation: 57281

As the warning mentions, the main thread of the worker wasn't able to do anything for a while. This is often caused by calling compiled functions that hold onto the GIL. They grab the GIL and then disappear into compiled code for a while, not letting any other Python code (like heartbeat messages) run.

The right way to solve this problem is to have your compiled code release the GIL. If you have control over this code then it is usually an easy fix in Cython and now, I think, default in cffi. If you're just calling subprocess then I don't have a good explanation for this. That should not hold onto the GIL.

Upvotes: 2

Related Questions