J.C.
J.C.

Reputation: 69

TPU VM profile tpu device in CLI

I create a Google TPU virtual machine for training my models. Are there tools like nvidia-smi that could show tpu usage in CLI?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   71C    P0   193W / 250W |  31128MiB / 32768MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:3D:00.0 Off |                  N/A |
| 51%   82C    P2   217W / 250W |  10671MiB / 11264MiB |     97%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:3E:00.0 Off |                  N/A |
|134%   79C    P2   200W / 250W |   8015MiB / 11264MiB |     82%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   45C    P0    28W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:B1:00.0 Off |                  N/A |
| 36%   60C    P2   103W / 250W |   9475MiB / 11264MiB |     25%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:B2:00.0 Off |                  N/A |
|135%   84C    P2   230W / 250W |   8015MiB / 11264MiB |     94%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The supposed output like this, it could tell me the memory and usage of TPUs.

I read the TPU user guide and found nothing like this.

Beside, capture_tpu_profile --tpu=v2-8 --monitoring_level=2 --tpu_zone=<my_zone> --gcp_project <my_project_id> return failed in VM.

2023-04-11 13:34:55.250537: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
[percpu.cc : 539] RAW: rseq syscall failed with errno 22 after membarrier sycall succeeded.
TensorFlow version 2.8.0 detected
Welcome to the Cloud TPU Profiler v2.4.0
I0411 13:35:28.692314 140515203357760 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0411 13:35:28.908957 140515203357760 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/tpu-cloud-381912/locations/us-central1-f/nodes/v2-8?alt=json
I0411 13:35:28.909139 140515203357760 transport.py:157] Attempting refresh to obtain initial access_token
Failed to find TPU <vm> in zone us-central1-f project <my_project_id>. You may use --tpu_zone and --gcp_project to specify the zone and project of your TPU.

Upvotes: 0

Views: 837

Answers (2)

Ben Bastian
Ben Bastian

Reputation: 11

There's a new CLI tpu-info for checking basic TPU utilization metrics. You can install it via pip with

pip install git+https://github.com/google/cloud-accelerator-diagnostics/#subdirectory=tpu_info

Upvotes: 1

Susie Sargsyan
Susie Sargsyan

Reputation: 191

There is no google supported official implementation of nvidia-smi for TPUs, however there is an opensource code available that you can search for and use.

From the error message looks like zone and/or project name aren't specified correctly. Can you clarify how exactly you are running capture_profile? Did you replace <my_zone> and <my_project_id> with the values that you used to create your TPU VM? Looks like it got zone as us-central1-f but not the project id.

capture_tpu_profile --tpu=v2-8 --monitoring_level=2 --tpu_zone=<my_zone> --gcp_project <my_project_id>

Update: you might also find this tool helpful https://cloud.google.com/tpu/docs/profile-tpu-vm#memory_viewer

Upvotes: 0

Related Questions