Fault tolerance in Dask dependency graphs

Question

I have a small cluster upon which I deploy a dask graph using:

from dask.distributed import Client
...
client = Client(f'{scheduler_ip}:{scheduler_port}', set_as_default=False)
client.get(workflow, final_node)

During the workflow I have a bunch of tasks that run in parallel, of course. Sometimes, however, there's an error in a module that one worker is running. As soon as that module fails it gets returned to the scheduler and then the scheduler stops the other works running in parallel (even if the others have no dependency on this one). It stops them midstream.

Is there anyway to allow the others to complete, then fail, instead of shutting them down immediately?

MRocklin · Accepted Answer

The Client.get function is all-or-nothing. You should probably look at the futures interface. Here you're launching many computations which happen to depend on each other. The ones that can finish will finish.

See https://docs.dask.org/en/latest/futures.html

Fault tolerance in Dask dependency graphs

Answers (1)

Related Questions