Global step does not start at 0

Question

I have an RNN using a MonitoredTrainingSession for distributed computation. I’m using global_step to identify which batch of input data each worker should take.

I have defined the tensor before creating the session

global_step_tensor = tf.Variable(0, dtype=tf.int32, trainable=False, name=‘global_step’)
...
minimise = optimiser.minimize(loss, name=‘adam_opt’, global_step=‘global_step’)
with tf.train.MonitoredTrainingSession(...) as sess:
    graph=tf.get_default_graph()
    curr_step=sess.run(global_step_tensor)
    print(curr_step) #gives 366

I thought the variable is only incremented on evaluation of the optimiser? Why does it start on 366?

Edit My cluster is defined as one ps and two workers. Currently, whilst I test, all three are running on the same host through different ports.

Kruup&#246;s · Accepted Answer

According to the documentation, MonitoredTrainingSession have several default arguments that make checkpoints automatically:

save_checkpoint_secs: The frequency, in seconds, that a checkpoint is saved using a default checkpoint saver. If save_checkpoint_secs is set to None, then the default checkpoint saver isn't used.

save_summaries_steps: The frequency, in number of global steps, that the summaries are written to disk using a default summary saver. If both save_summaries_steps and save_summaries_secs are set to None, then the default summary saver isn't used. Default 100.

save_summaries_secs: The frequency, in secs, that the summaries are written to disk using a default summary saver. If both save_summaries_steps and save_summaries_secs are set to None, then the default summary saver isn't used. Default not enabled.

Maybe that's why your current batch is not 0 anymore.

Global step does not start at 0

Answers (1)

Related Questions