Reputation: 11
School has a GPU computing cluster with 8 GPUS on each node. And We use the SLURM task management system to manage tasks. SLURM system prescribed if there is a task on a GPU, then no new tasks will be assigned to this GPU.
For example: On node1, there are 8 TITAN XP GPUS, and no one submit task, so we can use all 8 GPUS. In this situation, I can use a simple c++/cuda code to use all of them, like this:
for(int i = 0; i < 8; i++) {
cudaSetDevice(i);
......
}
But almost situation is someone has submit task, they may only use one or two GPUs, like this. His task is running in second GPU.
If i submit my task, also use above simple code, it will generate an error:
CUDA error at optStream.cu:496 code=10(cudaErrorInvalidDevice) "cudaSetDevice(coreID)"
i don't know how to solve this situation, i don't want to check the idle GPU number and recompile program, it's too inefficient.
So i need some advice.
Upvotes: 1
Views: 468
Reputation: 4926
SLURM should be correctly setting the CUDA_VISIBLE_DEVICES
environment variable to the IDs of the GPUs allocated to your job (hint: echo this variable in the script: if it's not happening it must be fixed).
In your code you will need to use "all the available gpus", that doesn't mean to use all the physically available GPUs, but the ones listed in that environment variable.
Your code will be portable with:
int count;
cudaGetDeviceCount ( &count );
for(int i = 0; i < count; i++) {
cudaSetDevice(i);
......
}
Example: if CUDA_VISIBLE_DEVICES=2,3
than your code will run on GPUs 2,3 - but you will see them as device 0 and 1 in the code.
Upvotes: 2