Reputation: 4050
While working with Keras and Jupyter Notebook, I occasionally get an error (see below for entire error log) once I start training a model. While Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, suggests that this is related to a version conflict, it does not seem to apply in my case. In my case, my versions seem to be working as I am able to run the training procedure just fine most of the time, however once I get this error I need to close all running python processes and restart Anaconda in order to proceed without errors.
Since restarting Anaconda each time this error occurs is very unhandy, I wonder if there is any fix or suggestion on why this error occurs other than a version conflict?
This is the entire error I am getting:
---------------------------------------------------------------------------
UnknownError Traceback (most recent call last)
<ipython-input-23-5d485feb54c5> in <module>
1 K.clear_session()
2 model_all = define_model(train_data)
----> 3 model_all = train_bild(train_generator_all,validation_generator_all, model_all)
4 model_all.save(subdir+cat+"/"+cat+"_model_all_inception.h5")
5
<ipython-input-17-afb528e9309d> in train_bild(train_generator, validation_generator, model)
25 epochs=num_epochs,
26 validation_data=validation_generator,
---> 27 validation_steps=VALID_STEPS, workers=16,callbacks=[checker,early, reduce_lr],class_weight=class_weights)#,class_weight=class_weights)
28
29 model = load_model(filepath)
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your `' + object_name + '` call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\engine\training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1416 use_multiprocessing=use_multiprocessing,
1417 shuffle=shuffle,
-> 1418 initial_epoch=initial_epoch)
1419
1420 @interfaces.legacy_generator_methods_support
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\engine\training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
215 outs = model.train_on_batch(x, y,
216 sample_weight=sample_weight,
--> 217 class_weight=class_weight)
218
219 outs = to_list(outs)
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\engine\training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1215 ins = x + y + sample_weights
1216 self._make_train_function()
-> 1217 outputs = self.train_function(ins)
1218 return unpack_singleton(outputs)
1219
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
2713 return self._legacy_call(inputs)
2714
-> 2715 return self._call(inputs)
2716 else:
2717 if py_any(is_tensor(x) for x in inputs):
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
2673 fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
2674 else:
-> 2675 fetched = self._callable_fn(*array_vals)
2676 return fetched[:len(self.outputs)]
2677
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py in __call__(self, *args, **kwargs)
1437 ret = tf_session.TF_SessionRunCallable(
1438 self._session._session, self._handle, args, status,
-> 1439 run_metadata_ptr)
1440 if run_metadata:
1441 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
526 None, None,
527 compat.as_text(c_api.TF_Message(self.status.status)),
--> 528 c_api.TF_GetCode(self.status.status))
529 # Delete the underlying status object from memory otherwise it stays alive
530 # as there is a reference to status from this from the traceback due to
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond_1/FusedBatchNorm/Switch"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv2d_1/convolution-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
[[{{node loss/mul/_4005}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4855_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Upvotes: 1
Views: 2185
Reputation: 1602
The misterious code (linked by Mastiff in the comment) from https://github.com/tensorflow/tensorflow/issues/24828 is this:
# python 3.6 and tensorflow (both 1.x and 2.0)
def allow_gpu_memory_growth(log_device_placement=True):
"""
Allow dynamic memory growth (by default, tensorflow allocates all gpu memory).
This sometimes fixes the
<<Error : Failed to get convolution algorithm.
This is probably because cuDNN failed to initialize,
so try looking to see if a warning log message was printed above>>.
May hurt performance slightly (see https://www.tensorflow.org/guide/gpu).
Usage: Run before any other code.
:param log_device_placement: set True to log device placement (on which device the operation ran)
:return:None
"""
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
config.log_device_placement = log_device_placement
sess = tf.compat.v1.Session(config=config)
set_session(sess)
Upvotes: 0
Reputation: 27042
I had this problem several times, all of them it was due to a dirty log file that the Saver was trying to restore - the only solution was to delete the last model checkpoint file and restart from the previous one (also removing the line referring the last one in the checkpoint.txt file).
Probably this happens when during the model saving something happens (the saver processed dies - something changes the file while is still in writing, ...)
Upvotes: 1