Reputation: 1013
I've been running a training job for the last 3 hours on GPU powered cloud machine with the following command:
python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_pets.config
and after running that, the log says this for example:
INFO:tensorflow:global step 14455: loss = 0.5896 (0.775 sec/step)
I1001 19:27:43.575182 140054916601600 tf_logging.py:116] global step 14455: loss = 0.5896 (0.775 sec/step)
How do I know how many steps are there to be done or how many steps are there in total?
Upvotes: 2
Views: 3231
Reputation: 370
In the ssd_mobilenet_v1_pets.config
it says in line 163:
num_steps: 200000
This is the number of total steps, the training script will perform if you did not make any changes.
Upvotes: 0
Reputation: 77837
If you're using a pre-defined model topology, you look up the training period (in epochs or steps) in the documentation that comes with the model. If you've made your own model, you determine the training period by watching the test results. When the accuracy reaches an acceptable level and then starts to drop, you're likely over-training. Back up to the high point of accuracy. Repeat this experiment a few times to determine the "sweet spot" for your model.
Upvotes: 1