zxcvvxcz
zxcvvxcz

Reputation: 11

Cannot connect to TPU with ssh on GCP

I was following the tutorial on https://cloud.google.com/tpu/docs/how-to.

I created a TPU instance, and tried to connect to it with gcloud compute ssh line. Then, this error occurred.

AppData\Local\Google\Cloud SDK>gcloud compute ssh node-1 --zone=asia-east1-c
PythonERROR: (gcloud.compute.ssh) Could not fetch resource:
 - The resource 'projects/project-masker/zones/asia-east1-c/instances/node-1' was not found

Trying to solve this error, I found out that the tpus were not included in the execution group.

AppData\Local\Google\Cloud SDK>gcloud compute tpus list
PythonNAME    ZONE          ACCELERATOR_TYPE  NETWORK  RANGE             STATUS
node-2  asia-east1-c  v2-8              default  10.75.202.248/29  READY
node-1  asia-east1-c  v2-8              default  10.82.81.168/29   READY


AppData\Local\Google\Cloud SDK>gcloud compute tpus execution-groups list
PythonListed 0 items.

This is what I got when I tried to restart the tpu.

PythonRequest issued for: [node-1]
Waiting for operation [projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-
e14800b7-d997be6b] to complete...done.
done: true
metadata:
  '@type': type.googleapis.com/google.cloud.common.OperationMetadata
  apiVersion: v1
  cancelRequested: false
  createTime: '2021-07-03T08:00:49.884674545Z'
  endTime: '2021-07-03T08:01:31.161199334Z'
  target: projects/project-masker/locations/asia-east1-c/nodes/node-1
  verb: update
name: projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-e14800b7-d997be6b
response:
  '@type': type.googleapis.com/google.cloud.tpu.v1.Node
  acceleratorType: v2-8
  apiVersion: V1
  cidrBlock: 10.82.81.168/29
  createTime: '2021-07-03T07:27:41.148997156Z'
  health: HEALTHY
  ipAddress: 10.82.81.170
  name: projects/project-masker/locations/asia-east1-c/nodes/node-1
  network: global/networks/default
  networkEndpoints:
  - ipAddress: 10.82.81.170
    port: 8470
  port: '8470'
  schedulingConfig: {}
  serviceAccount: [email protected]
  state: READY
  tensorflowVersion: pytorch-1.9

I tried to find some related articles on google, but I couldn't find any. How can I fix this?

Upvotes: 1

Views: 1859

Answers (1)

Allen Wang
Allen Wang

Reputation: 301

You can't SSH to a TPU node directly, so gcloud compute ssh {tpu_name} isn't expected to work.

You can, however, SSH directly to a TPU VM, please see this link. If you are already using TPU VM, then your issue is that you're trying

gcloud compute ssh

rather than

gcloud alpha compute tpus tpu-vm ssh ...

Upvotes: 3

Related Questions