Reputation: 865
I have an RNN using a MonitoredTrainingSession for distributed computation. I’m using global_step to identify which batch of input data each worker should take.
I have defined the tensor before creating the session
global_step_tensor = tf.Variable(0, dtype=tf.int32, trainable=False, name=‘global_step’)
...
minimise = optimiser.minimize(loss, name=‘adam_opt’, global_step=‘global_step’)
with tf.train.MonitoredTrainingSession(...) as sess:
graph=tf.get_default_graph()
curr_step=sess.run(global_step_tensor)
print(curr_step) #gives 366
I thought the variable is only incremented on evaluation of the optimiser? Why does it start on 366?
Edit My cluster is defined as one ps and two workers. Currently, whilst I test, all three are running on the same host through different ports.
Upvotes: 0
Views: 535
Reputation: 5474
According to the documentation, MonitoredTrainingSession
have several default arguments that make checkpoints automatically:
save_checkpoint_secs
: The frequency, in seconds, that a checkpoint is saved using a default checkpoint saver. If save_checkpoint_secs is set to None, then the default checkpoint saver isn't used.
save_summaries_steps
: The frequency, in number of global steps, that the summaries are written to disk using a default summary saver. If both save_summaries_steps and save_summaries_secs are set to None, then the default summary saver isn't used. Default 100.
save_summaries_secs
: The frequency, in secs, that the summaries are written to disk using a default summary saver. If both save_summaries_steps and save_summaries_secs are set to None, then the default summary saver isn't used. Default not enabled.
Maybe that's why your current batch is not 0
anymore.
Upvotes: 2