Alex Knapp
Alex Knapp

Reputation: 86

Tensorflow training causing UnknownError: CUDNN_STATUS_BAD_PARAM cuda_dnn.cc(1591)

I've fixed my error and will answer this myself at time of posting, due to the nature and rarity of UnknownErrors this may help other users.

I've been training a Tensorflow model recently and have chosen to use the train_on_batch method as that seemed appropriate for my code, I've stumbled across an error in between epochs.

UnknownError                              Traceback (most recent call last)
<ipython-input-30-b81516b8d970> in <module>
     37             upper = lower + batch_sizing
     38             for _ in range(epochs):
---> 39                 model.train_on_batch(x = [input1[lower:upper], input2[lower:upper]],
     40                                                  y = output[lower:upper])
     41         lower_lim += split_size

c:\users\alex\appdata\local\programs\python\python39\lib\site-packages\keras\engine\training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics, return_dict)
   1898                                                     class_weight)
   1899       self.train_function = self.make_train_function()
-> 1900       logs = self.train_function(iterator)
   1901 
   1902     logs = tf_utils.sync_to_numpy_or_python_type(logs)

c:\users\alex\appdata\local\programs\python\python39\lib\site-packages\tensorflow\python\util\traceback_utils.py in error_handler(*args, **kwargs)
    151     except Exception as e:
    152       filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153       raise e.with_traceback(filtered_tb) from None
    154     finally:
    155       del filtered_tb

c:\users\alex\appdata\local\programs\python\python39\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     56   try:
     57     ctx.ensure_initialized()
---> 58     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     59                                         inputs, attrs, num_outputs)
     60   except core._NotOkStatusException as e:

UnknownError:    CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1591): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
     [[{{node CudnnRNN}}]]
     [[model/lstm/PartitionedCall]] [Op:__inference_train_function_187281]

Function call stack:
train_function -> train_function -> train_function

I've seen no signs of GPU/CPU/RAM usage going through the roof.

Tensorflow: 2.7.0 GPU: GeForce RTX 3080

I have tried to turn off eager execution though that caused further errors.

Sadly due to the nature of the error there isn't much help regarding this issue.

Upvotes: 0

Views: 593

Answers (1)

Alex Knapp
Alex Knapp

Reputation: 86

The error here actually seems to come from training on an empty batch.

The loop defining lower was actually one that was like:

for lower in range(lower_lim, upper_lim, batch_sizing):

lower_lim and upper_lim would be used as split indices, namely I was attempting to split my dataset by x amount each time and lower_lim and upper_lim would be multiples of the split_size, upper_lim being one split_size above lower_lim.

The issue this caused is that after the first loop, lower_lim is incremented to what upper_lim was. As such lower would begin as split_size, then trying to index the split was basically me accessing the last item of the split (lower = lower_lim = split_size).

The simple fix is to do:

for lower in range(0, split_size, batch_sizing):

Upvotes: 1

Related Questions