u4lr
u4lr

Reputation: 41

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*
(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?

=============
the related code is below:

 def def_ps(self):
    self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)

def save(self,idx):    
    ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
    if not os.path.exists(ret):
        msg = "save model for %u path %s not exists."%(idx,ret)
        lg.error(msg)
        raise Exception(msg);

=============
the log is below:

2016-06-02 21:33:52,323 root         ERROR    save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root         ERROR    has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
  File "d_rl_main_model_dist_0.py", line 755, in run_worker
    model_a.save(next_model_idx)
  File "d_rl_main_model_dist_0.py", line 360, in save
    Trainer.save(self,save_idx)
  File "d_rl_main_model_dist_0.py", line 289, in save
    raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.

===========
not meets the tensorflow api which define Saver.save as below:

https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver

tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)

Returns:

A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.

Raises:

TypeError: If sess is not a Session.

ValueError: If latest_filename contains path components.

Upvotes: 0

Views: 895

Answers (1)

mrry
mrry

Reputation: 126184

The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.

Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.

Upvotes: 1

Related Questions