Reputation: 939
So I am looking at this example from Google and they make use of a MonitoredSession, which seems like a really convenient class to save summaries every n steps. According to the doc, the following snippet:
with tf.train.MonitoredTrainingSession(master=target,
is_chief=is_chief,
checkpoint_dir=job_dir,
save_checkpoint_secs=None,
save_summaries_steps=20) as session:
while True:
// do training
should save my summaries every 20 steps. And it almost does, however sometimes, my summaries not being saved and this is really a problem.
Inside, the MonitoredSession creates a SummarySaverHook class, and we would expect its before_run / after_run callbacks to be called once every n global_step. It seems to be the case.
What I have noticed is that the callbacks are not being called by the same threads, so I assume that this could be a source of issue, but really I have no idea what is going on, it is very difficult to debug.
I am sorry for the lack of clarity in my question, but I really have troubles understanding what is going on. Has anyone ever been in a similar situation or knows where this is coming from?
Thank you
Upvotes: 0
Views: 279
Reputation: 3251
Did you try to use the hooks
argument while using MonitoredTrainingSession
?
with tf.train.MonitoredTrainingSession(master=target, hooks=[<your hooks>],
is_chief=is_chief,
checkpoint_dir=job_dir,
save_checkpoint_secs=None,
save_summaries_steps=20) as session:
while True:
// do training
Upvotes: 1