Reputation: 141
I'm trying to use two processes to speed up a script that runs on a sequence of images (each image is its own optimization problem).
I'm using torch.multiprocessing to spawn two processes. Each process initializes tensors, models optimizers running on a different GPU:
if __name__ == '__main__':
num_processes = 2
processes = []
img_list = [...]
img_indices = np.range(0, len(img_list))
for gpu_idx in range(num_processes):
subindices = img_indices[gpu_idx::num_processes]
p = mp.Process(target=my_single_gpu_optimization_func, args=(img, img_list, subindices, gpu_idx))
p.start()
processes.append(p)
for p in processes:
p.join()
Inside my_single_gpu_optimization_func, I define the target device as:
device = f'cuda:{gpu}'
model = MyModel(device=device)
The idea is that each GPU processes half of the images.
So when running, I expect to see both GPUs loaded, but in practice, I see that the memory usage on the first GPU doubles, compared to the single-GPU use case, and the runtime halves. The second GPU seems to be idle.
Why am I unable to utilize both GPUs and double my throughput?
Upvotes: 0
Views: 298
Reputation: 141
What seems to work is to set CUDA_VISIBLE_DEVICES
inside each process/thread function.
So:
os.environ['CUDA_VISIBLE_DEVICES'] = f'{gpu}'
my_model.to('cuda:0')
This seems very crude. I might as well just run two instances of my code from the command line this way. Is there a cleaner way of doing this without setting environment variables?
(BTW, I'm not sure overriding environment variables would work with threads, I really do need to fork a separate process for this solution to work)
Upvotes: 1