Reputation: 1317
I am trying to do this tutorial(https://www.tensorflow.org/guide/tpu), but using my own provisioned TPU. But when I try to change to my own TPU I cannot resolve the TPU cluster, i.e., I get a timeout when I run the cell:
tpu_addr = f"{MY_TPU_IP}:8470" # os.environ['COLAB_TPU_ADDR'], if running colab's TPU
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=f'grpc://{tpu_addr}')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
MY_TPU_IP
I get from the TPU that I create in https://console.cloud.google.com/compute/tpus.
from google.colab import auth
auth.authenticate_user()
But the answer does not address this problem, as it points to a code example that requires to solve my problem (how to resolve the TPU cluster IP). Moreover, the answer is for TensorFlow v1 (which I could port to v2 myself if it would solve my problem in the first place)
Upvotes: 0
Views: 514
Reputation: 1317
I found out what was the problem. The IP in the TPU dashboard is an "Internal IP." The only way to access it is through a Jump server in the same data center as the TPU, i.e., SSH Tunneling.
Here is an article describing the whole process in detail: Train neural networks faster with tpu from your laptop
But the idea is:
Create a TPU
Create a Jump Server
Create a Storage Bucket
Run this command to create the SSH Tunnel: ssh $USER@$JUMP -L 2000:$TPU:8470
Train your model.
Upvotes: 1