Kasra
Kasra

Reputation: 2189

Keras ModelCheckpoint overwrites previous best checkpoint when training resumed

I am using ModelCheckpoint callback from Keras:

    checkpointer = ModelCheckpoint(filepath= model_filepath,
                                   verbose=1,
                                   save_best_only=True)

I cannot train my model in one step, so I have to save/load my model several times and resume the training to improve my model. However, when I load my model and resume the training, when the first epoch ends, since the val_loss changes from inf to some value (let's say 0.23) the previous model will be always overwritten. But my previous best val_loss in the previous time that I was training my model was 0.19 (0.19 < 0.23 => the previous model is still the best => previous model should not be overwritten).

How can I tell Keras: Please consider the previous best val_loss in the previous time that I trained my model and stop this wrong behavior.

Upvotes: 0

Views: 1442

Answers (2)

Abdul Hasib Uddin
Abdul Hasib Uddin

Reputation: 11

from tensorflow.keras.callbacks import ModelCheckpoint, LambdaCallback

work_dir = "drive/My Drive/Training Records/"
NUM_EPOCHS = 300
checkpointer_name  = "model_checkpoint.hdf5"
log_name = "log_"+checkpointer_name[:-5]+".log"

Step 1:

checkpointer = ModelCheckpoint(filepath = work_dir+checkpointer_name, 
                           monitor='val_loss', 
                           mode='auto', 
                           verbose = 0, 
                           save_best_only =False
                           )
checkpointer_best = ModelCheckpoint(filepath = work_dir+"best_"+checkpointer_name, 
                                monitor='val_loss', 
                                mode='auto',  
                                verbose = 1, 
                                save_best_only = True
                                )

Step 2:

def checkBestPerformance(epoch, logs):
    log_data = pd.read_csv(work_dir+log_name, sep=',', usecols=['val_loss', 'val_accuracy'], engine='python')
    min_val_loss = min(log_data.val_loss.values)
    max_val_acc = max(log_data.val_accuracy.values)

    current_val_acc = logs['val_accuracy']
    current_val_loss = logs['val_loss']

    save_filepath = work_dir+"best_"+checkpointer_name
    if current_val_loss < min_val_loss:
        model.save(filepath = save_filepath)
        print("\nval_loss decreased from", min_val_loss, "to", current_val_loss, ".")

    elif (current_val_loss==min_val_loss) and (current_val_acc>max_val_acc):
        model.save(filepath = save_filepath)
        print("\nval_accuracy increased from", max_val_acc, "to", current_val_acc, ".")

    else:
        print("\nPerformance did not improve from existing min_val_loss =", min_val_loss, ", max_val_acc =", max_val_acc, ".")

Step 3:

epochs_completed = 0
csv_logger = CSVLogger(work_dir+log_name, separator=',', append=True)

try:
    log_data = pd.read_csv(work_dir+log_name, sep=',', usecols=['epoch'], engine='python')
    epochs_completed = log_data.shape[0]

    if epochs_completed > 0:
        model = load_model(work_dir+checkpointer_name)
        list_callbacks = [checkpointer, LambdaCallback(on_epoch_end=checkBestPerformance), csv_logger]
        print("epochs_completed =", epochs_completed)
except:
    list_callbacks = [checkpointer, checkpointer_best, csv_logger]

Step 4:

print("Previously completed epochs =", epochs_completed, "\n")

history = model.fit(final_train_imageset, final_train_label, 
                shuffle=True, 
                batch_size = BATCH_SIZE, 
                epochs = NUM_EPOCHS - epochs_completed, 
                validation_split = 0.1,
                callbacks=list_callbacks
                )

Upvotes: 1

Patrick Na
Patrick Na

Reputation: 119

Since it is not programmed for this purpose I would not consider it wrong.

I would suggest that you change the filepath parameter for of the callback whenever you resume your training, this way at least you do not lose the previous best.

Upvotes: 1

Related Questions