Reputation: 147
I am using a Tensorflow object detection for training a two class model. While training the training starts at 0 and proceeds in 100 steps (logs are seen for every 100 steps) and when the step reaches 1000 (by 100, 200, 300, 400, 500....steps) it performs evaluation and I can view the results in tensorboard. After 1000 steps, the checkpoint gets saved for every step like 1001, 1002, 1003,.... and evaluation also happens for every single step. Why does this happen?
Tensorflow version: nvidia-tensorflow 1.15
Training is based on: https://colab.research.google.com/github/google-coral/tutorials/blob/master/retrain_ssdlite_mobiledet_qat_tf1.ipynb
Upvotes: 1
Views: 948
Reputation: 147
I found a fix, but don't understand it in depth.
In the python file "run_config.py" present under "python3.6/site-packages/tensorflow_estimator/python/estimator/run_config.py" there was a variable named, "save_checkpoints_steps" which was assigned a value "_USE_DEFAULT", after changing it to 1000, there was no problem and checkpoints were saving only for every 1000 checkpoints.
Still I don't know why "_USE_DEFAULT" was saving checkpoints for every single step
Upvotes: 1
Reputation: 306
I'm not sure about weights being saved at each step after 1000.
In trainer.py,if you are using the slim based. If you want to change the number of .ckpt model to keep yo must change the line 370 to:
saver = tf.train.Saver(
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours, max_to_keep=10)
In this case you will preserve the 10 last .ckpt
If you want to change the frequency of the .ckpt you must add inside of the slim.learning.train (line 397)
save_interval_secs=X
where X is the frequency in seconds
Upvotes: 0