Reputation:

Kernel died restarting whenever training a model

Here's the code:

# import libraries
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

# import dataset
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator()

test_datagen = ImageDataGenerator()

training_set = train_datagen.flow_from_directory(
                                            'data/spectrogramme/ensemble_de_formation',
                                            target_size = (64, 64),
                                            batch_size = 128,
                                            class_mode = 'binary')

test_set = test_datagen.flow_from_directory('data/spectrogramme/ensemble_de_test',
                                            target_size = (64, 64),
                                            batch_size = 128,
                                            class_mode = 'binary')

# initializing
reseau = Sequential()

# 1. convolution
reseau.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))
reseau.add(Conv2D(32, (3, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))
reseau.add(Conv2D(64, (3, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))
reseau.add(Conv2D(64, (3, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))

# 2. flatenning
reseau.add(Flatten())

# 3. fully connected
from keras.layers import Dropout
reseau.add(Dense(units = 64, activation = 'relu'))
reseau.add(Dropout(0.1))
reseau.add(Dense(units = 128, activation = 'relu'))
reseau.add(Dropout(0.05))
reseau.add(Dense(units = 256, activation = 'relu'))
reseau.add(Dropout(0.03))
reseau.add(Dense(units = 1, activation = 'sigmoid'))

# 4. compile
reseau.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# 5. fit
reseau.fit_generator(training_set, steps_per_epoch = 8000, epochs = 1,
                     validation_data = test_set, validation_steps = 2000)

This should prove that I have tensorflow GPU with CUDA and CUDNN installed pic

I don't know what to do, I have reinstalled CUDA and CUDNN multiple times

HOWEVER, if I uninstall tensorflow-gpu, the program runs flawlessly... with the exception of needing 5000 seconds per epoch... I'd like to avoid that

FYI, this is all happening on Windows

Any help is appreciated.

Upvotes: 13

Answers (13)

thankit28

Reputation: 36

I had a similar problem while trying to run the simples neural network with keras library.

model = Sequential()

model.add(Input(shape=(vocab_size, )))
model.add(Dense(embed_size, activation="linear"))
model.add(Dense(vocab_size, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam")

I was running on Apple M2, but kernel kept dying and restart notification would pop up on Jupyter.

It used fail every time I ran,

model.fit(X, y, epochs=1000)

pip install --upgrade tensorflow

this worked, as it updated the existing tensorflow, keras and other dependent libraries, hope it helps someone!

Upvotes: 0

MCPMH

Reputation: 403

Please check cudnn.

I had same problem and it was solved after using correct cudnn

Upvotes: 0

Jay Patel

Reputation: 31

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

This solution is provided by Krishna Kankipati at Kaggle site

Upvotes: 1

Maruata

Reputation: 41

The CUDA, CuDNN, Tensorflow and Python Version Compatibility table can be referred at https://www.tensorflow.org/install/source#gpu but I did with the following version installation and it works perfectly.

The problem can be solve by:

Install the latest anaconda navigator.
Install Python v3.8.x
Install Tensorflow v2.10.0
Install CUDA v11.8
Install CuDNN v8.6.x
Paste the zlibwapi.dll file in the CUDA bin folder. The file can be downloaded from https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows
Follow the instructions give on the above link.

This is working for me. I was not placing the zlibwapi.dll in the CUDA/bin folder earlier, that was the reason I faced the same problem.

I hope this helps.

Upvotes: 0

SavvY

Reputation: 101

I had a similar problem because I had cuda and cuDNN versions way higher than what is mentioned in the compatibility chart. The Dense layers would work fine for me but using Conv2D/Conv3D would kill my kernel.

Solution

Make sure you have the zlib file copied and pasted into your CUDA\v11.x\bin directory. I had issues downloading it from NVIDIA's website but found a way around.

In NVIDIA website, they referred to zlibwapi.dll- I was able to locate this file in “C:\Program Files\Microsoft Office\root\Office16\ODBC Drivers\Salesforce\lib” (I installed using Microsoft 365 x64 in windows 11) and copy pasted this file into “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin” I was able to run Tensorflow 2.8.0 thereafter

Thanks to srikkanth_kn's solution I was able to find the zlibwapi.dll file (in MS Office) and pasted it into the CUDA's bin folder (make sure CUDA's bin folder is in your PATH). After that everything was working fine. Hope this helps you and saves your time.

Upvotes: 3

Igor Rossi Fermo

Reputation: 1

I had Exactelly the same problem, I tried every solutions mentioned in this post and never works. After soo much tries, I found the problem, was the cuda installation, during the installation. I followed the Nvidia tutorial, but, at step of copy the 3 files from cudnn directory (like as tutorial) you should copy the 3 paths and just paste (substitute) at the nvidia directory, after this, my gpu works wothout problems

Upvotes: 0

MeriDK

Reputation: 1

I had the same issue. After all, running the file as .py helped to see the problem was with cuDNN. Not all files were installed.

Upvotes: 0

Hisan

Reputation: 2655

A very cumbersome issue with tensorflow-gpu. It took me days to find the best working solution.

What seems to be the problem:

I know you might have installed cudnn and cuda (just like me) after watching youtube videos or internet documentation. But since cuda and cudnn are very strict about version clashes so it's possible that there might have been a version mismatch between your tensorflow , cuda or cudnn version.

What's the solution:

The tensorflow build automatically selected by Anaconda on Windows 10 during the installation of tensorflow-gpu 2.3 seems to be faulty. Please find a workaround here (consider upvoting the GitHub answer if you have a GitHub account).

Python 3.7: conda install tensorflow-gpu=2.3 tensorflow=2.3=mkl_py37h936c3e2_0

Python 3.8: conda install tensorflow-gpu=2.3 tensorflow=2.3=mkl_py38h1fcfbd6_0

These snippets automatically download cuda and cudnn drivers along with the tensorflow-gpu. After trying out this solution i was able to fit() the tensorflow models as well as boost up the speed due to GPU installed.

A word of advice:

If you are working with machine learning / data science. I would strongly advice you shift to anaconda instead of pip. This would allow you to create virtual environments and easy integration with jupyter-notebooks. You can create a separate virtual environment for machine learning tasks as they often require upgradation or downgradation of libraries. With virtual environments it won't hurt your other packages outside the environment.

Upvotes: 4

SidK

Reputation: 1214

I had the same problem. In my case, the Notebook kernel was crashing as soon as I run the block with all model.add() code.

I went to Jupyter Home and found out that another notebook, which I had used earlier to train a model on GPU, was running, even though I had closed the notebook browser tab. As suggested by @Ian Henry. I shutdown the ones I wasn't using, restarted the kernel and run all the blocks again, and this time it worked perfectly fine.

Note that, the notebooks run in background even when you close the browser. You can verify this with if you check the icon for the respective notebook, which should be green if running and grey if not. To shutdown the running notebook, simply go to the Running tab, anc click the shutdown button next to the notebook name

Upvotes: 1

Amirkhm

Reputation: 1096

in my case I needed to install

conda install keras

Upvotes: 1

Zhanwen Chen

Reputation: 1463

I had the same issue running model.fit() on Jupyter Notebook. A good starting point for debugging is always downloading the notebook as a .py file and run it. This way you get all errors and warnings.

In terms of a solution - I doubt that this will solve most cases, but I installed cuDNN 7.2(.1) via .deb files, reinstalled tensorflow-gpu, and it worked. After all, it wasn't a version issue the driver (I had CUDA 9.0 and 384.xx which was correct), but one with cuDNN.

Upvotes: 0

Kathiravan Natarajan

Reputation: 3508

The problem is with the Jupyter notebook. I have the same problem going on with Jupyter notebook. If you run the same code in CPU based environment or in Terminal with GPU, it will work for sure.

Upvotes: 0

Ian Henry

Reputation: 21

If you are using Jupyter check for any running notebooks, and as I've found that they hang on to the GPU memory even when they are actively running.

In jupyter shutdown any unused running ones.

Upvotes: 0

Kernel died restarting whenever training a model

Answers (13)

Related Questions