Get mapping between /dev/nvidia* and nvidia-smi gpu list

Question

A server with 4 GPUs is used for deep learning. It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is

Nvidia-smi results

The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)

Some sub-processes are still alive thus occupy the memory.

One bruce-force solution is to kill all processes created by python using:

pkill -u user_name python

This should be helpful if there is only one process to be cleaned up.

Another solution proposed by pytorch official My GPU memory isn’t freed properly One may find them via

ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:

fuser -v /dev/nvidia*

fuser -v results

As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.

The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).

How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.

Martin Pecka · Accepted Answer

If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.

There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.

Get mapping between /dev/nvidia* and nvidia-smi gpu list

Answers (1)

Related Questions