Reputation: 588
I'm running into an issue where my model training slows down dramatically
Here is what happens:
Epoch 00001: val_loss did not improve from 0.03340
Run 27 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 156us/step - loss: 0.0420 - binary_accuracy: 0.9459 - accuracy: 0.9848 - val_loss: 0.0362 - val_binary_accuracy: 0.9501 - val_accuracy: 0.9876
Epoch 00001: val_loss did not improve from 0.03340
Run 28 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 150us/step - loss: 0.0422 - binary_accuracy: 0.9431 - accuracy: 0.9851 - val_loss: 0.0395 - val_binary_accuracy: 0.9418 - val_accuracy: 0.9863
Epoch 00001: val_loss did not improve from 0.03340
Run 29 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 6s 474us/step - loss: 0.0454 - binary_accuracy: 0.9479 - accuracy: 0.9833 - val_loss: 0.0395 - val_binary_accuracy: 0.9475 - val_accuracy: 0.9856
Epoch 00001: val_loss did not improve from 0.03340
Run 30 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 701us/step - loss: 0.0462 - binary_accuracy: 0.9406 - accuracy: 0.9830 - val_loss: 0.0339 - val_binary_accuracy: 0.9502 - val_accuracy: 0.9882
Epoch 00001: val_loss did not improve from 0.03340
Run 31 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 646us/step - loss: 0.0457 - binary_accuracy: 0.9462 - accuracy: 0.9836 - val_loss: 0.0375 - val_binary_accuracy: 0.9417 - val_accuracy: 0.9861
Epoch 00001: val_loss did not improve from 0.03340
Run 32 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 640us/step - loss: 0.0471 - binary_accuracy: 0.9313 - accuracy: 0.9827 - val_loss: 0.0373 - val_binary_accuracy: 0.9446 - val_accuracy: 0.9868
Epoch 00001: val_loss did not improve from 0.03340
Run 33 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 669us/step - loss: 0.0423 - binary_accuracy: 0.9458 - accuracy: 0.9852 - val_loss: 0.0356 - val_binary_accuracy: 0.9510 - val_accuracy: 0.9873
Epoch 00001: val_loss did not improve from 0.03340
Run 34 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 648us/step - loss: 0.0441 - binary_accuracy: 0.9419 - accuracy: 0.9841 - val_loss: 0.0407 - val_binary_accuracy: 0.9357 - val_accuracy: 0.9849
Epoch 00001: val_loss did not improve from 0.03340
Run 35 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 9s 713us/step - loss: 0.0460 - binary_accuracy: 0.9473 - accuracy: 0.9829 - val_loss: 0.0423 - val_binary_accuracy: 0.9604 - val_accuracy: 0.9840
Epoch 00001: val_loss did not improve from 0.03340
Run 36 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 7s 557us/step - loss: 0.0508 - binary_accuracy: 0.9530 - accuracy: 0.9810 - val_loss: 0.0470 - val_binary_accuracy: 0.9323 - val_accuracy: 0.9820
My GPU usage doesn't decrease (it actually increases):
My CPU usage,clocks and GPU clocks (core and memory) all remain about the same. My RAM usage also remains roughly the same.
The only strange part is my overall power drops (image in percent):
I've read somewhere that it was due to the beta_1 parameter for the ADAM optimizer, and that setting it to 0.99 should fix the issue, yet the issue persists.
Is there any other reason why this would be happening? It looks like something on the computation side, as there are no indicators of hardware/driver issues.
Upvotes: 0
Views: 2354
Reputation: 588
Just in case anyone has this issue, I'll just compile a list of things that might help:
K.clear_session()
(make sure you do import from keras import backend as K
)config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.compat.v1.Session(config=config)
with
)del VARIABLE
might help. Worst case scenario, you'll have to load smaller data chunks or decrease model size.If anyone has any other ideas for what might solve a problem like this, feel free to comment and I'll edit this answer.
Upvotes: 2