A question regarding error messages that occur when using CudnnLSTM Layer in keras

Question

Current situation

1) CuDNNLSTM layer is used in the deep-learning model structure.

2) It trained with time series data. time step for each batch is 1,000. (I think it is quite long.)

3) Learning was done with Tesla T4 on Google cloud platform.

4) Load the model on the local PC and use it using GPU GTX 1060(6gb)

5) Predicting by model on local pc, an error does not occur every time but sometimes an error occurs.

I have searched on google with error keyword and it seems to be GPU memory problem.

The reason why i think it's about GPU memory problem

1) Most of the solutions to this error message were solved by dynamically allocating memory setting.

config = tf.ConfigProto ()

config.gpu_options.allow_growth = True)

But it doesn't work to me.

2) if the weight of CuDNNLSTM is changed to LSTM model, predicting by model works well without the error. However, it is very slow. The prediction speed (model.predict) seems to be about 10x different.

Question

I wonder if this is a GPU memory problem, and it may not be able to be solved even if a new GPU is installed. Actually, the volume of the model is not that big(~100MB), and when checking the GPU usage, the GPU only took 1 GB of memory.

I'm curious as to what parts I'm going to be able to know exactly what caused the problem.

Thank you for reading the long text.

2020-03-11 14:16:53.437923: E tensorflow/stream_executor/cuda/cuda_dnn.cc:82] CUDNN_STATUS_INTERNAL_ERROR

in tensorflow/stream_executor/cuda/cuda_dnn.cc(1477): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'

2020-03-11 14:16:53.438538: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cudnn_rnn_ops.cc:1224 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, seq_length, batch_size]: [1, 38, 16, 1, 1000, 1]

Traceback (most recent call last):

File "D:\Anaconda3_64\envs\gpu\lib\site-packages\keras\engine	raining.py", line 1462, in predict

callbacks=callbacks)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages\keras\engine	raining_arrays.py", line 324, in predict_loop

batch_outs = f(ins_batch)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages	ensorflow\python\keras\backend.py", line 3076, in __call__

run_metadata=self.run_metadata)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages	ensorflow\python\client\session.py", line 1439, in __call__

run_metadata_ptr)

File "D:\Anaconda3_64\envs\gpu\lib\site-packages	ensorflow\python\framework\errors_impl.py", line 528, in __exit__

c_api.TF_GetCode(self.status.status))

tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, seq_length, batch_size]: [1, 38, 16, 1, 1000, 1]

[[{{node bidirectional_2/CudnnRNN}}]]

[[{{node dense_1/BiasAdd}}]]

A question regarding error messages that occur when using CudnnLSTM Layer in keras

Answers (1)

Related Questions