Reputation: 18141
In my dask-based application (using the distributed
scheduler), I'm seeing failures that start with this error text:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed
manages to do this is not clear to me -- maybe via a signal?)
Here's the dask portion of the second traceback:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
Does after timeout
indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask
or distributed
) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?
Upvotes: 5
Views: 4479
Reputation: 57261
Dask does not impose a timeout on tasks by default.
The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.
The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?
In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html
Upvotes: 2