Moving away from tf.contrib.learn: distributed training with dedicated evaluator process

Question

In TF 1.8's upcoming release, tf.contrib.learn.* will be deprecated. The tf.contrib.learn.Experiment class recommends switching to tf.estimator.train_and_evaluate instead, so I'm trying to port my code to that framework.

What I want to do is set up distributed training on two machines' GPUs, plus a third CPU-only process that does continuous evaluation on a small validation set.

Following the examples in the documentation of train_and_evaluate and the Distributed Tensorflow guide, I managed to set up the training half of my desired architecture, but I can't find a way to set up an estimator.

So far, what I have looks as follows:

def input_fn(mode, num_classes, batch_size):  
  # [...] build input pipeline
  return {'input': images}, labels

def model_fn(features, labels, num_classes, mode):
  # [...] build model
  return tf.estimator.EstimatorSpec(
    mode=mode,
    predictions=predictions,
    loss=total_loss,
    train_op=train_op,
    eval_metric_ops=metrics,
    export_outputs=export_outputs)

def distributed_main_v2(unused_argv):
  """Expects `unused_argv` to be a list ['', '']"""  
  import json
  # Set up environment variables according to the parameters passed to the process
  TF_CONFIG = {
    'cluster': {
        "ps": [
            "host1:2222",
        ],
        "chief": [
            "host1:2223",
            ],
        "worker": [
            "host2:2224"
            ]
    },
    'environment': 'cluster',    
    'task': {
        'type': unused_argv[1].strip(),
        'index': unused_argv[2].strip() if len(unused_argv) > 2 else 0
        }
  }
  os.environ['TF_CONFIG'] = json.dumps(TF_CONFIG)
  if unused_argv[1].strip() not in ['worker', 'chief']:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # leave the GPU to the worker process

  # create the estimator
  # define warm start configuration
  regex = '^(?!.*final_layer*|.*aux_logits*)'
  ws_settings = tf.estimator.WarmStartSettings('checkpoint_path', regex)

  gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.95) # fix for cuDNN fatal memory error with tf.contrib.learn.Experiment (TODO: still necessary?)
  sess_conf = tf.ConfigProto(gpu_options=gpu_opts)
  run_conf = tf.estimator.RunConfig(session_config=sess_conf)

  # Create the Estimator
  estimator = tf.estimator.Estimator(
    model_fn=lambda features, labels, mode: model_fn(features, labels, NUM_CLASSES, mode),
    model_dir=model_dir,
    config=run_conf,
    warm_start_from=ws_settings)

  # Set up input functions for training and evaluation
  train_input_fn = lambda : input_fn(tf.estimator.ModeKeys.TRAIN, NUM_CLASSES, batch_size)
  eval_input_fn = lambda : input_fn(tf.estimator.ModeKeys.EVAL, NUM_CLASSES, batch_size)

  train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=steps)
  eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)

  # start distributed training
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

if __name__ == '__main__':
  # set up globals and parse known arguments
  distributed_main_v2(unused_argv)

This code works, although my understanding of it is still pretty limited. I get what the PS and workers do, but from the specification of chief I understand this should be the "master" worker that also logs summaries and saves checkpoints. What I'm missing now is the periodic evaluation... and I'm at a loss. From the train_and_evaluate codebase I see there's some "evaluator" support but I don't understand how to set it up properly.

GPhilo · Accepted Answer

Note: While writing the question I eventually realised my mistake (i.e., being blind and not seeing code and documentation that by now I think I looked at 20 times at least), but I believe the question and matching answer might be useful to others so I decided to finish the question and self-answer it.

If I were to read the whole docs as they are written, I would have noticed the following:

Example of TF_CONFIG for evaluator task. Evaluator is a special task that is not part of the training cluster. There could be only one. It is used for model evaluation.

# This should be a JSON string, which is set as environment variable. Usually
# the cluster manager handles that.
TF_CONFIG='{
    "cluster": {
        "chief": ["host0:2222"],
        "worker": ["host1:2222", "host2:2222", "host3:2222"],
        "ps": ["host4:2222", "host5:2222"]
    },
    "task": {"type": "evaluator", "index": 0}
}'

As it turns out, yes, there is indeed support for the evaluation task and using it is a lot easier than I expected.

Just set the "task"part of TF_CONFIG to {"type": "evaluator", "index": 0} as shown above and there you have evaluation running. The confusing part for me was "Evaluator is a special task that is not part of the training cluster". This is, I believe, because the chief worker waits for all workers to register with him when starting the distributed session, so leaving the evaluator out of the cluster keeps training and evaluation independent of each other and makes training agnostic of evaluation.

Moving away from tf.contrib.learn: distributed training with dedicated evaluator process

Answers (1)

Related Questions