Doc
Doc

Reputation: 385

Get the number of free GPUs on a SLURM Cluster

I am scheduling jobs on a cluster to take up either 1 or 2 gpus of some nodes. I frequently use sinfo -p gpu to list all nodes of the 'gpu' partition as well as their state. Some appear with the state 'idle', indicating that there is no job running on them. Some however appear with state 'mix', meaning that there is some job running on them.

However, there is no information given how many GPUs on a mixed-State node are actually taken. Is there any, possibly sinfo based command, to let me know the number of free gpus on the server, possibly per node?

The sinfo manual did not gave any insights expect using the output option "%G" which just uses the number of gpus available in general. Thanks!

Update: I realized that I can use "%C" to print out the allocated/idle use of CPUs per node with the following command:

--format="%9P %l %10n %.14C %.10T "

I want to do the exact same thing but with GPUs instead of CPUs.

Upvotes: 3

Views: 2784

Answers (1)

damienfrancois
damienfrancois

Reputation: 59250

Unfortunately sinfo does not provide the information right away. You will have to parse the output of scontrol :

scontrol -o show node | grep  -Po "AllocTRES[^ ]*(?<=gpu=)\K[0-9]+" | paste -d + -s | bc

This lists all nodes, extracts the part that corresponds to AllocTRES (allocated trackable resources, which GPUs are part of), and in that part, more specifically the part that concerns the GPUs. It then uses paste and bc to compute the sum (you could be using awk instead if you prefer).

If you replace Alloc with Cfg in the one-liner, you will have the total number of GPUs configured.

Upvotes: 2

Related Questions