crwarwa
crwarwa

Reputation: 121

TPU core error on Google Cloud Platform (Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker)

The error message is this

Traceback (most recent call last):
  File "./run_classifier.py", line 914, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "./run_classifier.py", line 839, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 364, in train
    hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2746, in _convert_train_steps_to_hooks
    if ctx.is_running_on_cpu():
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 442, in is_running_on_cpu
    self._validate_tpu_configuration()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 613, in _validate_tpu_configuration
    'are {}.'.format(tpu_system_metadata.devices))
RuntimeError: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s). Available devices are (_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0
, CPU, 268435456, 8731946518164767128), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 14576891077092876257)).

I am trying to run the example of the paper(bert-gcn-for-paper-citation)

They open the source code on GitHub(https://github.com/TeamLab/bert-gcn-for-paper-citation) so I followed the readme and run the code on Google Cloud Platform(GCP) and got the error.

Before running the code, I checked the BERT and GAE on the GCP because the paper and code are built on these algorithms. BERT and GAE run successfully. (Followed GCP page about the BERT tutorial https://cloud.google.com/tpu/docs/tutorials/bert)

BERT and BERT-GCN have almost the same flags.

BERT example code

python3 ./bert/run_classifier.py \
--task_name=${TASK_NAME} \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DIR/${TASK_NAME} \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=${STORAGE_BUCKET}/${TASK_NAME}-output/ \
--use_tpu=True \
--tpu_name=$TPU_NAME

and I slightly changed it for BERT-GCN (added last 3 flags)

python3 ./run_classifier.py \
--task_name=${TASK_NAME} \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DIR/${TASK_NAME} \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=${STORAGE_BUCKET}/${TASK_NAME}-output/ \
--use_tpu=True \
--tpu_name=$TPU_NAME \
--model=bert_base \
--dataset=PeerRead \
--frequency=1

but I got the error and I couldn't find the solution I googled it but none of them was my solution to my problem.

Could you help me out?

Upvotes: 2

Views: 971

Answers (1)

Paddy Popeye
Paddy Popeye

Reputation: 1824

The issue here is that gcloud doesn't pass the value of the $TPU_NAME environment variable. You need to specify your TPU explicitly;

USE:

--tpu=my-tpu

INSTEAD OF:

--tpu=$TPU_NAME

Upvotes: 1

Related Questions