MetaStack
MetaStack

Reputation: 3716

Fault tolerance in Dask dependency graphs

I have a small cluster upon which I deploy a dask graph using:

from dask.distributed import Client
...
client = Client(f'{scheduler_ip}:{scheduler_port}', set_as_default=False)
client.get(workflow, final_node)

During the workflow I have a bunch of tasks that run in parallel, of course. Sometimes, however, there's an error in a module that one worker is running. As soon as that module fails it gets returned to the scheduler and then the scheduler stops the other works running in parallel (even if the others have no dependency on this one). It stops them midstream.

Is there anyway to allow the others to complete, then fail, instead of shutting them down immediately?

Upvotes: 1

Views: 183

Answers (1)

MRocklin
MRocklin

Reputation: 57319

The Client.get function is all-or-nothing. You should probably look at the futures interface. Here you're launching many computations which happen to depend on each other. The ones that can finish will finish.

See https://docs.dask.org/en/latest/futures.html

Upvotes: 1

Related Questions