alizaidi
alizaidi

Reputation: 61

TPU Based Tuning for CloudML

Are TPUs supported for distributed hyperparameter search? I'm using the tensor2tensor library, which supports CloudML for hyperparameter search, i.e., the following works for me to conduct hyperparameter search for a language model on GPUs:

t2t-trainer \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --problem=languagemodel_lm1b8k_packed \
  --train_steps=100000 \
  --eval_steps=8 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --cloud_mlengine \
  --hparams_range=transformer_base_range \
  --autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
  --autotune_maximize \
  --autotune_max_trials=100 \
  --autotune_parallel_trials=3

However, when I try to utilize TPUs as in the following:

t2t-trainer \
  --problem=languagemodel_lm1b8k_packed \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --train_steps=100000 \
  --use_tpu=True \
  --cloud_mlengine_master_type=cloud_tpu \
  --cloud_mlengine \
  --hparams_range=transformer_base_range \
  --autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
  --autotune_maximize \
  --autotune_max_trials=100 \
  --autotune_parallel_trials=5

I get the error:

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/******/jobs?alt=json returned "Field: master_type Error: The specified machine type for masteris not supported in TPU training jobs: cloud_tpu"

Upvotes: 0

Views: 196

Answers (2)

Ryan Sepassi
Ryan Sepassi

Reputation: 1501

One of the authors of the tensor2tensor library here. Yup, this was indeed a bug and is now fixed. Thanks for spotting. We'll release a fixed version on PyPI this week, and you can of course clone and install locally from master until then.

The command you used should work just fine now.

Upvotes: 3

Rajiv Bharadwaja
Rajiv Bharadwaja

Reputation: 100

I believe there is a bug in the tensor2tensor library: https://github.com/tensorflow/tensor2tensor/blob/6a7ef7f79f56fdcb1b16ae76d7e61cb09033dc4f/tensor2tensor/utils/cloud_mlengine.py#L281

It's the worker_type (and not the master_type) that needs to be set for Cloud ML Engine.

To answer the original question though, yes, HP Tuning should be supported for TPUs, but the error above is orthogonal to that.

Upvotes: 2

Related Questions