Training and validation loss increases after 10 epochs

Question

I am training an image captioning model. This model is consist of two other models, a BERT and an Xception model. I train both of these two models in parallel. The model training accuracy seems fine till 10 epochs then the loss starts increasing. The code and parameters of this model are as follows.

num_epochs = 20  # In practice, train for at least 30 epochs
batch_size = 1

vision_encoder = create_vision_encoder(num_projection_layers=1, projection_dims=256, dropout_rate=0.1)
text_encoder = create_text_encoder(num_projection_layers=1, projection_dims=256, dropout_rate=0.1)
dual_encoder = DualEncoder(text_encoder, vision_encoder, temperature=0.05)
dual_encoder.compile(optimizer=tfa.optimizers.AdamW(learning_rate=0.001, weight_decay=0.001), #run_eagerly=True)

from tensorflow.keras.callbacks import LearningRateScheduler
import math
def step_decay(epoch):
   initial_lrate = 0.001
   drop = 0.005
   epochs_drop = 10.0
   lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
   return lrate
lrate = LearningRateScheduler(step_decay)
callbacks_list = [lrate]

print(f"Number of GPUs: {len(tf.config.list_physical_devices('GPU'))}")
print(f"Number of examples (caption-image pairs): {train_example_count}")
print(f"Batch size: {batch_size}")
print(f"Steps per epoch: {int(np.ceil(train_example_count / batch_size))}")
train_dataset = get_dataset(os.path.join(tfrecords_dir, "train-*.tfrecord"), batch_size)
valid_dataset = get_dataset(os.path.join(tfrecords_dir, "valid-*.tfrecord"), batch_size)
# Create a learning rate scheduler callback.
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.2, patience=3)
# Create an early stopping callback.
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
history = dual_encoder.fit(
  train_dataset,
  epochs=num_epochs,
  validation_data=valid_dataset,
  callbacks=[reduce_lr, early_stopping, callbacks_list],
  )
print("Training completed. Saving vision and text encoders...")
vision_encoder.save("/content/drive/MyDrive/vision_encoder")
text_encoder.save("/content/drive/MyDrive/text_encoder")
print("Models are saved.")

model epochs

Number of GPUs: 1
Number of examples (caption-image pairs): 3500
Batch size: 1
Steps per epoch: 3500
Epoch 1/20
3500/3500 [==============================] - 217s 62ms/step - loss: 5.1028e-04 - val_loss: 1.9643e-04 - lr: 0.0010
Epoch 2/20
3500/3500 [==============================] - 218s 62ms/step - loss: 8.8274e-05 - val_loss: 3.3228e-05 - lr: 0.0010
Epoch 3/20
3500/3500 [==============================] - 220s 63ms/step - loss: 0.3582 - val_loss: 4.2012e-04 - lr: 0.0010
Epoch 4/20
3500/3500 [==============================] - 216s 62ms/step - loss: 9.6259e-04 - val_loss: 3.7130e-05 - lr: 0.0010
Epoch 5/20
3500/3500 [==============================] - 213s 61ms/step - loss: 1.7488e-05 - val_loss: 6.3365e-06 - lr: 2.0000e-04
Epoch 6/20
3500/3500 [==============================] - 208s 59ms/step - loss: 2.9985e-06 - val_loss: 1.0982e-06 - lr: 0.0010
Epoch 7/20
3500/3500 [==============================] - 207s 59ms/step - loss: 1.0761 - val_loss: 0.0212 - lr: 0.0010
Epoch 8/20
3500/3500 [==============================] - 211s 60ms/step - loss: 0.0062 - val_loss: 4.6654e-05 - lr: 2.0000e-04
Epoch 9/20
3499/3500 [============================>.] - ETA: 0s - loss: 2.2375e-05Epoch 10/20
3500/3500 [==============================] - 210s 60ms/step - loss: 234.2512 - val_loss: 309.9704 - lr: 5.0000e-06
Epoch 11/20
3500/3500 [==============================] - 211s 60ms/step - loss: 310.0370 - val_loss: 309.7400 - lr: 1.0000e-06
Training completed. Saving vision and text encoders...
WARNING:absl:Found untraced functions such as restored_function_body, restored_function_body, restored_function_body, restored_function_body, restored_function_body while saving (showing 5 of 124). These functions will not be directly callable after loading.
Models are saved.

Training and validation loss increases after 10 epochs

Answers (1)

Related Questions