muneeb
muneeb

Reputation: 141

Tensorflow: restore checkpoint variables into distributed setup

I have a saved checkpoint generated by graph code in a regular non-distributed setup with the constraint with tf.device('/cpu:0'): (to force model params to reside on CPU instead of GPU). Now I converted the same code/graph to a distributed setting following the guidelines in TF-Inception. Now when I try to restore the checkpoint in distributed setup, I get device mismatch errors. Is there a way to override the requirements saved in the checkpoint file or something? My new distributed code has the Saver and scopes defined as:

if FLAGS.job_name == 'worker':
    with tf.device(tf.train.replica_device_setter(
            worker_device="/job:worker/task:%d" % FLAGS.task_id,
            cluster=cluster_spec)):
        # ...same network-graph code... #
        restorer = tf.train.Saver()
        with tf.Session() as sess:
            restorer.restore(sess, 'ResNet-L50.ckpt')

My cluster has one ps and one worker, and both are on localhost. Error line:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
     [[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]

Full error trace:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2200, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
  File "dlaunch.py", line 85, in <module>
    tf.app.run()      # (tf.app.flags parsed here)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dlaunch.py", line 81, in main
    dtrainer.train(server.target, cluster_spec)
  File "/home/muneeb/parkingtf/dtrainer.py", line 88, in train
    restorer.restore(sess, 'ResNet-L50.ckpt')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1103, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 328, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 563, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 658, in _do_call
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'save/restore_slice_268/shape_and_slice': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
     [[Node: save/restore_slice_268/shape_and_slice = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: >, _device="/job:ps/task:0/device:CPU:0"]()]]
Caused by op u'save/restore_slice_268/shape_and_slice', defined at:
  File "dlaunch.py", line 85, in <module>
    tf.app.run()      # (tf.app.flags parsed here)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dlaunch.py", line 81, in main
    dtrainer.train(server.target, cluster_spec)
  File "/home/muneeb/parkingtf/dtrainer.py", line 86, in train
    restorer = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 845, in __init__
    restore_sequentially=restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 515, in build
    filename_tensor, vars_to_save, restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 271, in _AddRestoreOps
    values = self.restore_op(filename_tensor, vs, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 186, in restore_op
    preferred_shard=preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 201, in _restore_slice
    preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 271, in _restore_slice
    preferred_shard=preferred_shard, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 444, in apply_op
    as_ref=input_arg.is_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 179, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 166, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2162, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
    self._traceback = _extract_stack()

Upvotes: 0

Views: 1914

Answers (1)

mrry
mrry

Reputation: 126184

The following line:

with tf.Session() as sess:

...is responsible for the error. Passing no arguments to tf.Session() creates an in-process session that can only use devices on the local machine. To work in the distributed mode, you should have something like:

# Assuming you created `server = tf.train.Server(...)` earlier.
with tf.Session(server.target) as sess:

...or, if you are connecting to a different process:

# Assuming your server is in a different process.
with tf.Session("grpc://..."):

Note that the devices are not stored in the checkpoint file, but they are being added by the tf.train.replica_device_setter(). Device configuration is a bit tricky right now, and it's something that we're working to simplify.

Upvotes: 2

Related Questions