Jon Deaton
Jon Deaton

Reputation: 4379

Restore TensorFlow model on different machine

I trained on TensorFlow model on a GPU cluster, saved the model using

saver = tf.train.Saver()
saver.save(sess, config.model_file, global_step=global_step)

and now I am trying to restore the model with

saver = tf.train.import_meta_graph('model-1000.meta')
saver.restore(sess,tf.train.latest_checkpoint(save_path))

for evaluation, on a different system. The issue is that saver.restore yields the following error:

    Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/jonpdeaton/Developer/BraTS18-Project/segmentation/evaluate.py", line 205, in <module>
    main()
  File "/Users/jonpdeaton/Developer/BraTS18-Project/segmentation/evaluate.py", line 162, in main
    restore_and_evaluate(save_path, model_file, output_dir)
  File "/Users/jonpdeaton/Developer/BraTS18-Project/segmentation/evaluate.py", line 127, in restore_and_evaluate
    saver.restore(sess, tf.train.latest_checkpoint(save_path))
  File "/Users/jonpdeaton/anaconda3/envs/BraTS/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1857, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/Users/jonpdeaton/anaconda3/envs/BraTS/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 337, in get_matching_files
    for single_filename in filename
  File "/Users/jonpdeaton/anaconda3/envs/BraTS/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /afs/cs.stanford.edu/u/jdeaton/dfs/unet; No such file or directory

It seems as though there are some paths that were stored in the model or checkpoint file form the system that it was trained on, that are no longer valid on the system that I am doing evaluation on. How do I restore a model (for evaluation) on a different machine after having copied the model-X.meta, model-X.index and checkpoint files?

Upvotes: 0

Views: 513

Answers (2)

Siyuan Ren
Siyuan Ren

Reputation: 7844

By default, the Saver object will write the absolute model checkpoint paths into the checkpoint file. So the path returned by tf.train.latest_checkpoint(save_path) is the absolute path on your old machine.

Temporary solution:

  1. Pass the actual model file path directly to the restore method rather than the result of tf.train.latest_checkpoint.
  2. Manually edit the checkpoint file, which is a simple text file.

Long term solution:

saver = tf.train.Saver(save_relative_paths=True)

Upvotes: 1

Jon Deaton
Jon Deaton

Reputation: 4379

Open up the checkpoint file with your favorite text editor and simply change the absolute paths found therein to just filenames.

Upvotes: 0

Related Questions