Rocketq
Rocketq

Reputation: 5791

Keras with tensorflow-gpu totally freezes PC

I have pretty simple architecture lstm NN. After few epoch 1-2 my PC totally freezes I can't even move my mouse :

Layer (type)                 Output Shape              Param #   
=================================================================
lstm_4 (LSTM)                (None, 128)               116224    
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 98)                12642     
=================================================================
Total params: 128,866
Trainable params: 128,866
Non-trainable params: 0

    # Same problem  with 2 layers LSTM  with dropout and Adam optimizer

    SEQUENCE_LENGTH =3, len(chars) = 98
    model = Sequential()
    model.add(LSTM(128, input_shape = (SEQUENCE_LENGTH, len(chars))))
    #model.add(Dropout(0.15))
    #model.add(LSTM(128))
    model.add(Dropout(0.10))
    model.add(Dense(len(chars), activation = 'softmax'))

    model.compile(loss = 'categorical_crossentropy', optimizer = RMSprop(lr=0.01), metrics=['accuracy'])

This is how I train:

history = model.fit(X, y, validation_split=0.20, batch_size=128, epochs=10, shuffle=True,verbose=2).history

NN needs 5 minutes to finish 1 epoch. Higher size of batch doesn't mean that problem will occur faster. But more complex model can train more time achieving almost same accuracy - about 0.46 (full code here )

I have last up to date Linux Mint, 1070ti with 8GB, 32Gb ram

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:08:00.0 On | N/A |
| 0% 35C P8 10W / 180W | 303MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Libraries:

Keras==2.2.0
Keras-Applications==1.0.2
Keras-Preprocessing==1.0.1
keras-sequential-ascii==0.1.1
keras-tqdm==2.0.1
tensorboard==1.8.0
tensorflow==1.0.1
tensorflow-gpu==1.8.0

I have tried limit GPU memory usage, but it can't be a problem here because during training it eats only 1 GB of gpu memory:

from keras.backend.tensorflow_backend 
import set_session config = tf.ConfigProto() 

config.gpu_options.per_process_gpu_memory_fraction = 0.9 

config.gpu_options.allow_growth = True set_session(tf.Session(config=config))

What is wrong here? How can I fix the problem?

Upvotes: 5

Views: 4146

Answers (3)

Eric Schwerzel
Eric Schwerzel

Reputation: 21

I had this exact problem. The computer died after about 15 minutes of training. I found that it was a memory SIMM card that died when it got warm / hot. If you have more than one SIMM card, you can take one out at a time and see if it is the culprit.

Upvotes: 2

Rocketq
Rocketq

Reputation: 5791

This is some kind of weird for me but problem was related with my new just april 2018 released CPU from AMD. So having up to date linux kernel was crucial: following this guide https://itsfoss.com/upgrade-linux-kernel-ubuntu/ I updated kernel from 4.13 to 4.17 - now everything works

UPD: The motherboard was crashing the system as well, I have changed it - now everythings works well

Upvotes: 0

Snehal
Snehal

Reputation: 757

  • Please remove cpu version of tensorflow==1.0.1 first. Try installing the tensorflow-gpu==1.8.0 by building TensorFlow from sources as mentioned here

or

  • Replace LSTM with CuDNNLSTM while training model on GPU. Later load the trained model weights into same model architecture with LSTM layer to use the model on CPU. (Make sure to use recurrent_activation='sigmoid' in LSTM layer when re-loading CuDNNLSTM model weights!)

Upvotes: 1

Related Questions