Reputation: 69
I create a Google TPU virtual machine for training my models.
Are there tools like nvidia-smi
that could show tpu usage in CLI?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:1B:00.0 Off | 0 |
| N/A 71C P0 193W / 250W | 31128MiB / 32768MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:3D:00.0 Off | N/A |
| 51% 82C P2 217W / 250W | 10671MiB / 11264MiB | 97% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:3E:00.0 Off | N/A |
|134% 79C P2 200W / 250W | 8015MiB / 11264MiB | 82% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:88:00.0 Off | 0 |
| N/A 45C P0 28W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A |
| 36% 60C P2 103W / 250W | 9475MiB / 11264MiB | 25% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... On | 00000000:B2:00.0 Off | N/A |
|135% 84C P2 230W / 250W | 8015MiB / 11264MiB | 94% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The supposed output like this, it could tell me the memory and usage of TPUs.
I read the TPU user guide and found nothing like this.
Beside, capture_tpu_profile --tpu=v2-8 --monitoring_level=2 --tpu_zone=<my_zone> --gcp_project <my_project_id>
return failed in VM.
2023-04-11 13:34:55.250537: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
[percpu.cc : 539] RAW: rseq syscall failed with errno 22 after membarrier sycall succeeded.
TensorFlow version 2.8.0 detected
Welcome to the Cloud TPU Profiler v2.4.0
I0411 13:35:28.692314 140515203357760 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0411 13:35:28.908957 140515203357760 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/tpu-cloud-381912/locations/us-central1-f/nodes/v2-8?alt=json
I0411 13:35:28.909139 140515203357760 transport.py:157] Attempting refresh to obtain initial access_token
Failed to find TPU <vm> in zone us-central1-f project <my_project_id>. You may use --tpu_zone and --gcp_project to specify the zone and project of your TPU.
Upvotes: 0
Views: 837
Reputation: 11
There's a new CLI tpu-info
for checking basic TPU utilization metrics. You can install it via pip with
pip install git+https://github.com/google/cloud-accelerator-diagnostics/#subdirectory=tpu_info
Upvotes: 1
Reputation: 191
There is no google supported official implementation of nvidia-smi
for TPUs, however there is an opensource code available that you can search for and use.
From the error message looks like zone and/or project name aren't specified correctly. Can you clarify how exactly you are running capture_profile
? Did you replace <my_zone>
and <my_project_id>
with the values that you used to create your TPU VM? Looks like it got zone as us-central1-f but not the project id.
capture_tpu_profile --tpu=v2-8 --monitoring_level=2 --tpu_zone=<my_zone> --gcp_project <my_project_id>
Update: you might also find this tool helpful https://cloud.google.com/tpu/docs/profile-tpu-vm#memory_viewer
Upvotes: 0