Reputation: 115

How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API?

I'm training MobileNet on WIDER FACE dataset and I encountered problem I couldn't solve. TF Object Detection API stores only last 5 checkpoints in train dir, but what I would like to do, is to save best models relative to mAP metric (or at least leave many more models in train dir before deletion). For example, today I've looked at Tensorboard after next night of training and I see that overnight model has over-fitted and I can't restore best checkpoint, because it was deleted already.

EDIT: I just use Tensorflow Object Detection API, it by default saves last 5 checkpoints in train dir which I point. I look for some configuration parameter or anything that will change this behavior.

Has anyone have some fix in code/config param to set/workaround for that? It seems like I'm missing something, it should be obvious that what's in fact important is the best model, not the newest one (which can overfit).

Thanks!

Upvotes: 9

Answers (5)

Vignesh Kathirkamar

Reputation: 147

For saving more checkpoints, you can write a simple python script that will store the checkpoints in a timely manner to a specific.

import os
import shutil
import time

while True:
    
    training_file = '/home/vignesh/training' # path of your train directory
    archive_file = 'home/vignesh/training/archive' #path of the directory where you want to save your checkpoints
    files_to_save = []

    for files in os.listdir(training_file):
        
        if files.rsplit('.')[0]=='model':
            
            files_to_save.append(files)

    for files in files_to_save:
        if files in os.listdir(archive_file):
            pass
        else:
            shutil.copy2(training_file+'/'+files,archive_file)
    time.sleep(600) # This will make the script run for every 600 seconds, modify it for your need

Upvotes: 0

Ravi Sharma

Reputation: 60

You can follow up this PR. Here your best checkpoint is saved within your checkpoint directory, sub-directory named as best.

You just need to integrate the best_saver() and (method call in _run_checkpoint_once()) inside ../object_detection/eval_util.py

Additionally it will also create a json for all_evaluation_metrices.

Upvotes: 1

David de la Iglesia

Reputation: 2534

You can modify (hardcoding in your fork or opening a pull request and adding the options to protos) the arguments passed to tf.train.Saver in:

https://github.com/tensorflow/models/blob/master/research/object_detection/legacy/trainer.py#L376-L377

You will probably want to set:

max_to_keep: Maximum number of recent checkpoints to keep. Defaults to 5.
keep_checkpoint_every_n_hours: How often to keep checkpoints. Defaults to 10,000 hours.

Upvotes: 6

Yongman Hong

Reputation: 45

You can change config.

in run_config.py

class RunConfig(object):
  """This class specifies the configurations for an `Estimator` run."""

  def __init__(self,
           model_dir=None,
           tf_random_seed=None,
           save_summary_steps=100,
           save_checkpoints_steps=_USE_DEFAULT,
           save_checkpoints_secs=_USE_DEFAULT,
           session_config=None,
           keep_checkpoint_max=10,
           keep_checkpoint_every_n_hours=10000,
           log_step_count_steps=100,
           train_distribute=None,
           device_fn=None,
           protocol=None,
           eval_distribute=None,
           experimental_distribute=None):

Upvotes: 3

efont

Reputation: 246

You may be interested by this Tf github thread that tackles the newest/best checkpoint issue. A user developed his own wrapper, chekmate, around tf.Saver to keep track of the best checkpoints.

Upvotes: 1

How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API?

Answers (5)

Related Questions