explodingfilms101
explodingfilms101

Reputation: 588

Why does Keras training slow down after a while?

I'm running into an issue where my model training slows down dramatically

Here is what happens:


Epoch 00001: val_loss did not improve from 0.03340
Run 27 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 156us/step - loss: 0.0420 - binary_accuracy: 0.9459 - accuracy: 0.9848 - val_loss: 0.0362 - val_binary_accuracy: 0.9501 - val_accuracy: 0.9876

Epoch 00001: val_loss did not improve from 0.03340
Run 28 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 150us/step - loss: 0.0422 - binary_accuracy: 0.9431 - accuracy: 0.9851 - val_loss: 0.0395 - val_binary_accuracy: 0.9418 - val_accuracy: 0.9863

Epoch 00001: val_loss did not improve from 0.03340
Run 29 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 6s 474us/step - loss: 0.0454 - binary_accuracy: 0.9479 - accuracy: 0.9833 - val_loss: 0.0395 - val_binary_accuracy: 0.9475 - val_accuracy: 0.9856

Epoch 00001: val_loss did not improve from 0.03340
Run 30 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 701us/step - loss: 0.0462 - binary_accuracy: 0.9406 - accuracy: 0.9830 - val_loss: 0.0339 - val_binary_accuracy: 0.9502 - val_accuracy: 0.9882

Epoch 00001: val_loss did not improve from 0.03340
Run 31 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 646us/step - loss: 0.0457 - binary_accuracy: 0.9462 - accuracy: 0.9836 - val_loss: 0.0375 - val_binary_accuracy: 0.9417 - val_accuracy: 0.9861

Epoch 00001: val_loss did not improve from 0.03340
Run 32 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 640us/step - loss: 0.0471 - binary_accuracy: 0.9313 - accuracy: 0.9827 - val_loss: 0.0373 - val_binary_accuracy: 0.9446 - val_accuracy: 0.9868

Epoch 00001: val_loss did not improve from 0.03340
Run 33 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 669us/step - loss: 0.0423 - binary_accuracy: 0.9458 - accuracy: 0.9852 - val_loss: 0.0356 - val_binary_accuracy: 0.9510 - val_accuracy: 0.9873

Epoch 00001: val_loss did not improve from 0.03340
Run 34 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 648us/step - loss: 0.0441 - binary_accuracy: 0.9419 - accuracy: 0.9841 - val_loss: 0.0407 - val_binary_accuracy: 0.9357 - val_accuracy: 0.9849

Epoch 00001: val_loss did not improve from 0.03340
Run 35 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 9s 713us/step - loss: 0.0460 - binary_accuracy: 0.9473 - accuracy: 0.9829 - val_loss: 0.0423 - val_binary_accuracy: 0.9604 - val_accuracy: 0.9840

Epoch 00001: val_loss did not improve from 0.03340
Run 36 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 7s 557us/step - loss: 0.0508 - binary_accuracy: 0.9530 - accuracy: 0.9810 - val_loss: 0.0470 - val_binary_accuracy: 0.9323 - val_accuracy: 0.9820

My GPU usage doesn't decrease (it actually increases):

enter image description here

My CPU usage,clocks and GPU clocks (core and memory) all remain about the same. My RAM usage also remains roughly the same.

The only strange part is my overall power drops (image in percent):

enter image description here

I've read somewhere that it was due to the beta_1 parameter for the ADAM optimizer, and that setting it to 0.99 should fix the issue, yet the issue persists.

Is there any other reason why this would be happening? It looks like something on the computation side, as there are no indicators of hardware/driver issues.

Upvotes: 0

Views: 2354

Answers (1)

explodingfilms101
explodingfilms101

Reputation: 588

Just in case anyone has this issue, I'll just compile a list of things that might help:

  1. Try setting your beta_1 to 0.99 in the ADAM optimizer
  2. If you're running model.fit() multiple times, adding this after the fit() might also help: K.clear_session() (make sure you do import from keras import backend as K)
  3. Slap this after your imports (if using tensorflow > 2.0):
config = tf.compat.v1.ConfigProto()

config.gpu_options.allow_growth=True

sess = tf.compat.v1.Session(config=config)
  1. If you have an opened file (after using file.open()) make sure you close (or better yet, use with)
  2. Make sure nothing else is running in the background that could use the GPU (games, heavy websites, etc.)
  3. Check your pagefile usage. Since pagefile is significantly slower than RAM, you might be running out of memory. Doing del VARIABLE might help. Worst case scenario, you'll have to load smaller data chunks or decrease model size.
  4. Try setting the GPU to maximum performance in the NVIDIA control panel

If anyone has any other ideas for what might solve a problem like this, feel free to comment and I'll edit this answer.

Upvotes: 2

Related Questions