Reputation: 86
I've fixed my error and will answer this myself at time of posting, due to the nature and rarity of UnknownErrors this may help other users.
I've been training a Tensorflow model recently and have chosen to use the train_on_batch method as that seemed appropriate for my code, I've stumbled across an error in between epochs.
UnknownError Traceback (most recent call last)
<ipython-input-30-b81516b8d970> in <module>
37 upper = lower + batch_sizing
38 for _ in range(epochs):
---> 39 model.train_on_batch(x = [input1[lower:upper], input2[lower:upper]],
40 y = output[lower:upper])
41 lower_lim += split_size
c:\users\alex\appdata\local\programs\python\python39\lib\site-packages\keras\engine\training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics, return_dict)
1898 class_weight)
1899 self.train_function = self.make_train_function()
-> 1900 logs = self.train_function(iterator)
1901
1902 logs = tf_utils.sync_to_numpy_or_python_type(logs)
c:\users\alex\appdata\local\programs\python\python39\lib\site-packages\tensorflow\python\util\traceback_utils.py in error_handler(*args, **kwargs)
151 except Exception as e:
152 filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153 raise e.with_traceback(filtered_tb) from None
154 finally:
155 del filtered_tb
c:\users\alex\appdata\local\programs\python\python39\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
56 try:
57 ctx.ensure_initialized()
---> 58 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
59 inputs, attrs, num_outputs)
60 except core._NotOkStatusException as e:
UnknownError: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1591): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
[[{{node CudnnRNN}}]]
[[model/lstm/PartitionedCall]] [Op:__inference_train_function_187281]
Function call stack:
train_function -> train_function -> train_function
I've seen no signs of GPU/CPU/RAM usage going through the roof.
Tensorflow: 2.7.0 GPU: GeForce RTX 3080
I have tried to turn off eager execution though that caused further errors.
Sadly due to the nature of the error there isn't much help regarding this issue.
Upvotes: 0
Views: 593
Reputation: 86
The error here actually seems to come from training on an empty batch.
The loop defining lower was actually one that was like:
for lower in range(lower_lim, upper_lim, batch_sizing):
lower_lim and upper_lim would be used as split indices, namely I was attempting to split my dataset by x amount each time and lower_lim and upper_lim would be multiples of the split_size, upper_lim being one split_size above lower_lim.
The issue this caused is that after the first loop, lower_lim is incremented to what upper_lim was. As such lower would begin as split_size, then trying to index the split was basically me accessing the last item of the split (lower = lower_lim = split_size).
The simple fix is to do:
for lower in range(0, split_size, batch_sizing):
Upvotes: 1