Tae Hyun Jo
Tae Hyun Jo

Reputation: 16

A question regarding error messages that occur when using CudnnLSTM Layer in keras

Current situation

1) CuDNNLSTM layer is used in the deep-learning model structure.

2) It trained with time series data. time step for each batch is 1,000. (I think it is quite long.)

3) Learning was done with Tesla T4 on Google cloud platform.

4) Load the model on the local PC and use it using GPU GTX 1060(6gb)

5) Predicting by model on local pc, an error does not occur every time but sometimes an error occurs.

I have searched on google with error keyword and it seems to be GPU memory problem.

The reason why i think it's about GPU memory problem

1) Most of the solutions to this error message were solved by dynamically allocating memory setting.

config = tf.ConfigProto ()

config.gpu_options.allow_growth = True)

But it doesn't work to me.

2) if the weight of CuDNNLSTM is changed to LSTM model, predicting by model works well without the error. However, it is very slow. The prediction speed (model.predict) seems to be about 10x different.

Question

I wonder if this is a GPU memory problem, and it may not be able to be solved even if a new GPU is installed. Actually, the volume of the model is not that big(~100MB), and when checking the GPU usage, the GPU only took 1 GB of memory.

I'm curious as to what parts I'm going to be able to know exactly what caused the problem.

Thank you for reading the long text.

2020-03-11 14:16:53.437923: E tensorflow/stream_executor/cuda/cuda_dnn.cc:82] CUDNN_STATUS_INTERNAL_ERROR

in tensorflow/stream_executor/cuda/cuda_dnn.cc(1477): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'

2020-03-11 14:16:53.438538: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cudnn_rnn_ops.cc:1224 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, seq_length, batch_size]: [1, 38, 16, 1, 1000, 1]

Traceback (most recent call last):

File "D:\Anaconda3_64\envs\gpu\lib\site-packages\keras\engine\training.py", line 1462, in predict

callbacks=callbacks)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages\keras\engine\training_arrays.py", line 324, in predict_loop

batch_outs = f(ins_batch)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages\tensorflow\python\keras\backend.py", line 3076, in __call__

run_metadata=self.run_metadata)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__

run_metadata_ptr)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__

c_api.TF_GetCode(self.status.status))

tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, seq_length, batch_size]: [1, 38, 16, 1, 1000, 1]

[[{{node bidirectional_2/CudnnRNN}}]]

[[{{node dense_1/BiasAdd}}]]

Upvotes: 0

Views: 198

Answers (1)

Zabir Al Nazi Nabil
Zabir Al Nazi Nabil

Reputation: 11208

  1. Try reducing your batch size.
  2. It seems a windows bug, if you are running on windows, try Ubuntu and see if the error is still there. https://github.com/tensorflow/tensorflow/issues/33924
  3. Try running on CPU (tensorflow-cpu), the bug should go away.

Upvotes: 1

Related Questions