Tensorflow imprecise timeouts

Question

I've been testing out the the timeout functionality for sess.runs (applied to a convolutional neural network), and it seems like the timeouts aren't very precise.

For example, if I set the timeout to be 800 ms, there might be a 1-2 second delay before the timeout exception is triggered. This sort of leads me to believe that cancellation notifications aren't caught between computational nodes. (Which according to the timeline are .2-.5 s each)

So

1) Is there a way to make the timeouts more precise?

2) Are Tensorflow cancellation notifications caught between node computations?

mrry · Accepted Answer

The cancellation and timeout mechanism in TensorFlow was only designed to cancel a small number of blocking operations, in particular: dequeuing from an empty queue, enqueuing to a full queue, and reading from a file.

If you run a graph containing non-blocking operations, such as tf.matmul() and tf.nn.conv2d(), and the timeout expires, TensorFlow will typically wait until these operations have completed before returning with a "deadline exceeded" error.

Why is this the case? We added cancellation because users started to build pipelines of blocking operations into their graphs (e.g. for reading data) and some form of cancellation was needed to shut down these pipelines cleanly. Timeouts also help to debug deadlocks that can unfortunately occur in these pipelines. By contrast, TensorFlow is designed to dispatch non-blocking operations as efficiently as possible: for example, when running on a GPU, TensorFlow will asynchronously enqueue multiple operations on the GPU compute stream without blocking on their completion. Although it would technically be possible to check for cancellation between the execution of each operation, this would add latency to operation dispatch, and reduce overall performance in the common case.

However, if timeouts/cancellation for non-blocking operations would be useful for your use case, please feel free to open a GitHub issue as a feature request!

Tensorflow imprecise timeouts

Answers (1)

Related Questions