dask distributed and work stealing: how to tidy up tasks that have run twice?

Question

Dask distributed supports work stealing, which can speed up the computation and makes it more robust, however, each task can be run more than once.

Here I am asking for a way to "tidy up" the results of workers, who did not contribute to the final result. To illustrate what I am asking for:

Let's assume, each worker is doing Monte-Carlo like simulations and saves the ~10GB simulation result in a results folder. In case of work stealing, the simulation result will be stored several times, making it desirable to only keep one of it. What would be the best way to achieve this? Can dask.distributed automatically call some "tidy up" procedure on tasks that did not end up to contribute to the final result?

Edit: I currently start the simulation using the following code:

c = distributed.Client(myserver)
mytask.compute(get = c.get) #mytask is a delayed object

So I guess, afterwards all data is deleted from the cluster and if I "look at data that exists in multiple locations" after the computation, it is not guaranteed that I can find the respective tasks? Also I currently have no clear idea how to map the ID of the future object to the filename to which the respective task saved its results. I currently rely on tempfile to avoid name collisions, which given the setting of a Monte Carlo simulation is by far the easiest.

MRocklin · Accepted Answer

There is currently no difference in Dask between a task that was stolen and a task that was not.

If you want you can look at data that exists in multiple locations and then send commands directly to those workers using the following operations:

dask distributed and work stealing: how to tidy up tasks that have run twice?

Answers (1)

Related Questions