Reputation: 39
I want to make several GPUs visible using
os.environ["CUDA_VISIBLE_DEVICES"] = <GPU_IDs>
the following does not work for me, perhaps because the GPUs are split into MIG partitions.
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
Sample GPU IDs (nvidia-smi -L):
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-b654bde8-a9d1-d27a-91eb-00000000000a)
MIG 3g.20gb Device 0: (UUID: MIG-e4a69dad-640c-5006-b7f6-00000000000c)
MIG 1g.5gb Device 1: (UUID: MIG-df1904ce-d118-5cc6-8f05-000000000007)
MIG 1g.5gb Device 2: (UUID: MIG-1b6f718c-a2db-59d5-a83d-00000000000a)
MIG 1g.5gb Device 3: (UUID: MIG-9882d1bb-3062-5d15-b0d6-000000000009)
MIG 1g.5gb Device 4: (UUID: MIG-198d257f-725f-529c-ac47-000000000004)
Other ways I have already tried:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = 'MIG-e4a69dad-640c-5006-b7f6-00000000000c'
works but only for one GPU Id.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = 'MIG-e4a69dad-640c-5006-b7f6-00000000000c', 'MIG-df1904ce-d118-5cc6-8f05-000000000007'
TypeError: str expected, not tuple
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "MIG-1b6f718c-a2db-59d5-a83d-00000000000a, MIG-9882d1bb-3062-5d15-b0d6-000000000009, MIG-198d257f-725f-529c-ac47-000000000004"
from torch.cuda import device_count
print('Number of Devices: ', device_count())
Number of Devices: 1
This does not cause an error but gives me apparently only one GPU.
List of GPU_IDs also doesn't work:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ["MIG-1b6f718c-a2db-59d5-a83d-00000000000a", "MIG-9882d1bb-3062-5d15-b0d6-000000000009", "MIG-198d257f-725f-529c-ac47-000000000004"]
TypeError: str expected, not list
Would be happy for any help!
Upvotes: 4
Views: 5272
Reputation: 151879
The nature of MIG partitioning is that only one MIG "instance" will be visible to any instantiation of the CUDA runtime, which is kind of like saying per process.
Therefore making 2 (or more) MIG instances available/visible still won't allow you to use them from a single process in CUDA. You can only use one.
So your statement "works but only for one GPU Id" is indicating a correct usage, and the actual limitation of MIG.
See here:
"With CUDA 11, only enumeration of a single MIG instance is supported."
As an aside, it seems evident that you are not using multiprocessing. But if you were using multiprocessing
then it is probably possible to use "multiple MIG GPUs", but you will still only want to enable/expose one per process, and in fact you are still limited to one per process. But each process would need a separate statement like the one you have already shown:
os.environ["CUDA_VISIBLE_DEVICES"] = 'MIG-e4a69dad-640c-5006-b7f6-00000000000c'
and its outside the scope of my answer to give a complete recipe for that.
Upvotes: 4