Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

Question

I'm facing a strange problem with Torque assignment of GPUs.

I'm running Torque 6.1.0 on a single machine that has two NVIDIA GTX Titan X GPUs. I'm using pbs_sched for scheduling. nvidia-smi output at rest is as follows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0      On |                  N/A |
| 22%   40C    P8    15W / 250W |      0MiB / 12204MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   33C    P8    14W / 250W |      0MiB / 12207MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I have a simple test script to assess GPU assignment as follows:

#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

deviceQuery is the utility that comes with CUDA. When I run it from the command line, it correctly finds both GPUs. When I restrict to one device from the command-line like this...

CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

... it also correctly finds one or the other GPU.

When I submit test.sh to the queue with qsub, and when no other jobs are running, it again works correctly. Here's the output:

CUDA_VISIBLE_DEVICES: 0 
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"   CUDA Driver Version / Runtime Version          8.0 / 8.0   CUDA Capability Major/Minor version number:    5.2   Total amount of global memory:                 12204 MBytes (12796887040 bytes)   (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores   GPU Max Clock rate:                    1076 MHz (1.08 GHz)   Memory Clock rate:                             3505 Mhz   Memory Bus Width:                              384-bit   L2 Cache Size:                                 3145728 bytes   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers   Total amount of constant memory:               65536 bytes   Total amount of shared memory per block:       49152 bytes   Total number of registers available per block: 65536   Warp size:                                     32   Maximum number of threads per multiprocessor:  2048   Maximum number of threads per block:           1024   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)   Maximum memory pitch:            2147483647 bytes   Texture alignment:                             512 bytes   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)   Run time limit on kernels:                     No   Integrated GPU sharing Host Memory:            No   Support host page-locked memory mapping:       Yes   Alignment requirement for Surfaces:            Yes   Device has ECC support:                     Disabled   Device supports Unified Addressing (UVA):      Yes   Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0   Compute Mode:
     < Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS

However, if a job is already running on gpu0 (i.e. if it is assigned CUDA_VISIBLE_DEVICES=1), the job cannot find any GPUs. Output:

CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

Anyone know what is going on here?

Shaun · Accepted Answer

I think I've solved my own problem, but unfortunately I tried two things at once. I don't want to go back and confirm which solved the issue. It's one of the following:

Remove the --enable-cgroups option from Torque's configure script before building.
Running these steps in the Torque install process:

make packages

sh torque-package-server-linux-x86_64.sh --install

sh torque-package-mom-linux-x86_64.sh --install

sh torque-package-clients-linux-x86_64.sh --install

For the second option, I know that these steps are properly documented in the Torque install instructions. However, I have a simple setup where I just have a single node (compute node and server are same machine). I thought that 'make install' should do everything that the package installs do for that single node, but maybe I was mistaken.

Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

Answers (1)

Related Questions