Reputation: 13
I am trying to use ml-engine to tune some hyperparameters of a custom model. The model runs fine when I run on a single instance (eg, standard_gpu or complex_model_m_gpu), but fails when I try to run the same job on a cluster of gpu-enabled machines. I am following the instructions for CUSTOM tier using a config.yaml file, as described here. Adding this config file to the submission is the only change. Is there something else I need to do to run a distributed job?
I am submitting the job like this: g
cloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.10 \
--python-version 3.5 \
--module-name module.run_task \
--package-path module/ \
--region $REGION \
--config hptuning_config.yaml \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
My setup.py file requires tensorflow-probability 0.3.0 (the model breaks if I use 0.4.0).
The error (seen on all workers) is pasted below. Any help appreciated!
worker-replica-0 Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/module/run_task.py", line 74, in train_and_evaluate(hparams) File "/root/.local/lib/python3.5/site-packages/module/run_task.py", line 42, in train_and_evaluate tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 617, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 627, in run_worker return self._start_distributed_training() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 747, in _start_distributed_training self._start_std_server(config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 735, in _start_std_server start=False) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/server_lib.py", line 147, in init self._server_def.SerializeToString(), status) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server
Upvotes: 0
Views: 746
Reputation: 121
This error occurs because you are calling tf.estimator.train_and_evaluate
twice in succession. When the second train_and_evaluate
call is made, not all gRPC servers from the first call have been closed and new servers are attempted to be opened on ports that are still in use.
Running multiple distributed jobs in succession is not supported in TensorFlow, as the parameter servers in particular will block until the process is killed. You'll need to refactor your code to only include a single call to tf.estimator.train_and_evaluate
per job.
Upvotes: 0