Reputation: 33
Background:
I have written a CUDA program that performs processing on a sequence of symbols. The program processes all sequences of symbols in parallel with the stipulation that all sequences are of the same length. I'm sorting my data into groups with each group consisting entirely of sequences of the same length. The program processes 1 group at a time.
Question:
I am running my code on a Linux machine with 4 GPUs and would like to utilize all 4 GPUs by running 4 instances of my program (1 per GPU). Is it possible to have the program select a GPU that isn't in use by another CUDA application to run on? I don't want to hardcode anything that would cause problems down the road when the program is run on different hardware with a greater or fewer number of GPUs.
Upvotes: 3
Views: 2324
Reputation: 465
I wrote a program that manages the CUDA_VISIBLE_DEVICES
for scheduling and executing multiple instances of a CUDA program on a multi-GPU machine:
https://github.com/zhou13/gpurun
Assuming you are on an 4-GPU machine, and you want to run 100 inference jobs. You want to run them in parallel. Each job needs 2 GPUs and each GPU can only run three jobs at a time to avoid GPU memory overflow. You think you need a cluster scheduler like sbatch but for a single machine on GPUs. If this is your problem, the simple program is the perfect solution for you!
# 1. Run infer.py (needs 1 GPU) on 100 images with all the GPUs in parallel, 1 job per GPU.
for i in $(seq 1 100); do gpurun python infer.py $i.jpg & done
# 2. Same as 1, but put 2 jobs per GPU at the same time.
for i in $(seq 1 100); do gpurun -j2 python infer.py $i.jpg & done
# 3. Same as 2, but use gnu-parallel to simplify the command.
parallel -j0 gpurun -j2 python infer.py {} ::: $(seq 1 100)
# 4. Same as 1, but infer.py now will see 2 GPUs.
parallel -j0 gpurun -g2 python infer.py {} ::: $(seq 1 100)
# 5. You can customize the GPUs to be used with --gpus.
parallel -j0 gpurun --gpus 0,1 python infer.py {} ::: $(seq 1 100)
# 6. You can customize the name of lockfile with --session.
parallel -j0 gpurun --session ml-seesion python infer.py {} ::: $(seq 1 100)
Upvotes: 0
Reputation: 5807
There is a better (more automatic) way, which we use in PIConGPU that is run on huge (and different) clusters. See the implementation here: https://github.com/ComputationalRadiationPhysics/picongpu/blob/909b55ee24a7dcfae8824a22b25c5aef6bd098de/src/libPMacc/include/Environment.hpp#L169
Basically: Call cudaGetDeviceCount
to get the number of GPUs, iterate over them and call cudaSetDevice
to set this as the current device and check, if that worked. This check could involve test-creating a stream due to some bug in CUDA which made the setDevice succeed but all later calls failed as the device was actually in use.
Note: You may need to set the GPUs to exclusive-mode so a GPU can only be used by one process. If you don't have enough data of one "batch" you may want the opposite: Multiple process submit work to one GPU. So tune according to your needs.
Other ideas are: Start a MPI-application with the same number of processes per rank as there are GPUs and use the same device number as the local rank number. This would also help in applications like yours that have different datasets to distribute. So you can e.g. have MPI rank 0 process length1-data and MPI rank 1 process length2-data etc.
Upvotes: 0
Reputation: 151799
The environment variable CUDA_VISIBLE_DEVICES
is your friend.
I assume you have as many terminals open as you have GPUs. Let's say your application is called myexe
Then in one terminal, you could do:
CUDA_VISIBLE_DEVICES="0" ./myexe
In the next terminal:
CUDA_VISIBLE_DEVICES="1" ./myexe
and so on.
Then the first instance will run on the first GPU enumerated by CUDA. The second instance will run on the second GPU (only), and so on.
Assuming bash, and for a given terminal session, you can make this "permanent" by exporting the variable:
export CUDA_VISIBLE_DEVICES="2"
thereafter, all CUDA applications run in that session will observe only the third enumerated GPU (enumeration starts at 0), and they will observe that GPU as if it were device 0 in their session.
This means you don't have to make any changes to your application for this method, assuming your app uses the default GPU or GPU 0.
You can also extend this to make multiple GPUs available, for example:
export CUDA_VISIBLE_DEVICES="2,4"
means the GPUs that would ordinarily enumerate as 2 and 4 would now be the only GPUs "visible" in that session and they would enumerate as 0 and 1.
In my opinion the above approach is the easiest. Selecting a GPU that "isn't in use" is problematic because:
So the best advice (IMO) is to manage the GPUs explicitly. Otherwise you need some form of job scheduler (outside the scope of this question, IMO) to be able to query unused GPUs and "reserve" one before another app tries to do so, in an orderly fashion.
Upvotes: 5