user3586341
user3586341

Reputation: 67

How to have distributed tensorflow experiment run evaluation on worker instead of master?

I'm using Google Cloud ML Engine to train a model with tensorflow.contrib.learn.Experiment. By default it seems that tensorflow has the master server run the evaluations. I only run evals after the training is complete (min_eval_frequency=0), and my master has a large number of cores and RAM but no GPU (so the eval is very slow relative to the P100 workers). Can I make the eval run on a worker?

Upvotes: 0

Views: 282

Answers (1)

rhaertel80
rhaertel80

Reputation: 8389

When using learn_runner.run, there is no way to run evaluation on regular workers. Here are a few alternatives:

  1. Use a GPU on your master.
  2. Don't use learn_runner.run. Instead, you'll have to reproduce that functionality. To wit:

Instantiate an instance of RunConfig(). Inspect the task_type and invoke Experiment.train, Experiment.evaluate, or Experiment.continuous_eval as necessary.

That said, since the Master is basically just another worker that also does evaluation, is there any reason not to use a GPU on the Master?

Upvotes: 1

Related Questions