Reputation: 16
Current situation
1) CuDNNLSTM layer is used in the deep-learning model structure.
2) It trained with time series data. time step for each batch is 1,000. (I think it is quite long.)
3) Learning was done with Tesla T4 on Google cloud platform.
4) Load the model on the local PC and use it using GPU GTX 1060(6gb)
5) Predicting by model on local pc, an error does not occur every time but sometimes an error occurs.
I have searched on google with error keyword and it seems to be GPU memory problem.
The reason why i think it's about GPU memory problem
1) Most of the solutions to this error message were solved by dynamically allocating memory setting.
config = tf.ConfigProto ()
config.gpu_options.allow_growth = True)
But it doesn't work to me.
2) if the weight of CuDNNLSTM is changed to LSTM model, predicting by model works well without the error. However, it is very slow. The prediction speed (model.predict) seems to be about 10x different.
Question
I wonder if this is a GPU memory problem, and it may not be able to be solved even if a new GPU is installed. Actually, the volume of the model is not that big(~100MB), and when checking the GPU usage, the GPU only took 1 GB of memory.
I'm curious as to what parts I'm going to be able to know exactly what caused the problem.
Thank you for reading the long text.
2020-03-11 14:16:53.437923: E tensorflow/stream_executor/cuda/cuda_dnn.cc:82] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1477): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-03-11 14:16:53.438538: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cudnn_rnn_ops.cc:1224 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, seq_length, batch_size]: [1, 38, 16, 1, 1000, 1]
Traceback (most recent call last):
File "D:\Anaconda3_64\envs\gpu\lib\site-packages\keras\engine\training.py", line 1462, in predict
callbacks=callbacks)
File "D:\Anaconda3_64\envs\gpu\lib\site-packages\keras\engine\training_arrays.py", line 324, in predict_loop
batch_outs = f(ins_batch)
File "D:\Anaconda3_64\envs\gpu\lib\site-packages\tensorflow\python\keras\backend.py", line 3076, in __call__
run_metadata=self.run_metadata)
File "D:\Anaconda3_64\envs\gpu\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
run_metadata_ptr)
File "D:\Anaconda3_64\envs\gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, seq_length, batch_size]: [1, 38, 16, 1, 1000, 1]
[[{{node bidirectional_2/CudnnRNN}}]]
[[{{node dense_1/BiasAdd}}]]
Upvotes: 0
Views: 198
Reputation: 11208
Upvotes: 1