Tensorflow training causing UnknownError: CUDNN_STATUS_BAD_PARAM cuda_dnn.cc(1591)

Question

I've fixed my error and will answer this myself at time of posting, due to the nature and rarity of UnknownErrors this may help other users.

I've been training a Tensorflow model recently and have chosen to use the train_on_batch method as that seemed appropriate for my code, I've stumbled across an error in between epochs.

UnknownError                              Traceback (most recent call last)
 in 
     37             upper = lower + batch_sizing
     38             for _ in range(epochs):
---> 39                 model.train_on_batch(x = [input1[lower:upper], input2[lower:upper]],
     40                                                  y = output[lower:upper])
     41         lower_lim += split_size

c:\users\alex\appdata\local\programs\python\python39\lib\site-packages\keras\engine	raining.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics, return_dict)
   1898                                                     class_weight)
   1899       self.train_function = self.make_train_function()
-> 1900       logs = self.train_function(iterator)
   1901 
   1902     logs = tf_utils.sync_to_numpy_or_python_type(logs)

c:\users\alex\appdata\local\programs\python\python39\lib\site-packages	ensorflow\python\util	raceback_utils.py in error_handler(*args, **kwargs)
    151     except Exception as e:
    152       filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153       raise e.with_traceback(filtered_tb) from None
    154     finally:
    155       del filtered_tb

c:\users\alex\appdata\local\programs\python\python39\lib\site-packages	ensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     56   try:
     57     ctx.ensure_initialized()
---> 58     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     59                                         inputs, attrs, num_outputs)
     60   except core._NotOkStatusException as e:

UnknownError:    CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1591): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
     [[{{node CudnnRNN}}]]
     [[model/lstm/PartitionedCall]] [Op:__inference_train_function_187281]

Function call stack:
train_function -> train_function -> train_function

I've seen no signs of GPU/CPU/RAM usage going through the roof.

Tensorflow: 2.7.0 GPU: GeForce RTX 3080

I have tried to turn off eager execution though that caused further errors.

Sadly due to the nature of the error there isn't much help regarding this issue.

Tensorflow training causing UnknownError: CUDNN_STATUS_BAD_PARAM cuda_dnn.cc(1591)

Answers (1)

Related Questions