Distributed Tensorflow with CudnnLSTM

Question

I've been using the cudnn_rnn models in Tensorflow in a single session environment, and they run fine. However, Tensorflow crashes when I tried using cudnnLSTM in distributed runs with 1 PS host and several GPU workers.

from tensorflow.contrib.cudnn_rnn.python.layers import cudnn_rnn
with tf.device(tf.train.replica_device_setter(
  worker_device = "/job:worker/task:%d" % TASK_INDEX, cluster = cluster)):
    lstm  = cudnn_rnn.CudnnLSTM(self.layers, self.hidden_units)
with tf.train.MonitoredTrainingSession(master   = server.target,
                                       is_chief = (TASK_INDEX == 0),
                                       checkpoint_dir = CHECKPOINT_DIR,
                                       hooks    = hooks) as sess:
    ...

I get the following error in one of my worker processes (that have access to GPUs):

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save/CudnnRNNCanonicalToParams': Could not satisfy explicit device specification '/job:worker/task:0/device:CPU:0' because no supported kernel for CPU devices is available.
 [[Node: save/CudnnRNNCanonicalToParams = CudnnRNNCanonicalToParams[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", num_params=12, rnn_mode="gru", seed=0, seed2=0, _device="/job:worker/task:0/device:CPU:0"](save/CudnnRNNCanonicalToParams/num_layers, save/CudnnRNNCanonicalToParams/num_units, save/CudnnRNNCanonicalToParams/input_size, save/Reshape, save/Reshape_1, save/Reshape_2, save/Reshape_3, save/Reshape_4, save/Reshape_5, save/Reshape_6, save/Reshape_7, save/Reshape_8, save/Reshape_9, save/Reshape_10, save/Reshape_11, save/split_3, save/split_3:1, save/RestoreV2_22, save/split_4, save/split_4:1, save/RestoreV2_23, save/split_8, save/split_8:1, save/RestoreV2_25, save/split_9, save/split_9:1, save/RestoreV2_26)]]

I tried setting save_checkpoint_secs = None in MonitoredTrainingSession but still get the same error.

I have read the comments in tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py that mentions saving parameters and using PS server, but can't find a working example. Any ideas on how to make distributed tensorflow and cudnnLSTM work together?

Update: @Ash's answer about updating tensorflow helped. Also, for now, I need to specify no sharding in the Saver:

   with tf.train.MonitoredTrainingSession(master   = server.target,
                                          is_chief = (TASK_INDEX == 0),
                                          checkpoint_dir = CHECKPOINT_DIR,
                                          scaffold = tf.train.Scaffold(
                                                saver = tf.train.Saver(sharded = False, allow_empty = True)),
                                          hooks    = hooks) as sess:

ash · Accepted Answer

I believe this was a bug that has been fixed in HEAD but that fix is not part of any release yet, so to get the fix you'll have to build TensorFlow from source, or somehow incorporate the same fixes in your installation.

The fix is in this commit: https://github.com/tensorflow/tensorflow/commit/56da08fed6862422904411a61059b38940a57338

Distributed Tensorflow with CudnnLSTM

Answers (1)

Related Questions