Reputation: 4765
I am wondering if there are any advantages/drawbacks of launching a kernel on multiple GPUs via cudaLaunchCooperativeKernelMultiDevice
when no actual cooperation is happening vs traditional loop:
for loop over device ids
{
cudaSetDevice(id);
kernel<<<..., stream[i]>>>( ... );
}
cudaLaunchCooperativeKernelMultiDevice
is definitely less code than a loop...
Upvotes: 1
Views: 834
Reputation: 152173
One possible drawback is that the multi grid cooperative launch mechanism is not supported on all multi-GPU systems, whereas the launch-in-a-loop method is.
So by using cudaLaunchCooperativeKernelMultiDevice
you are restricting the footprint of systems where your code can run correctly to systems which have the cudaDevAttrCooperativeMultiDeviceLaunch
property set.
Amongst the various limitations implied by this are not being able to run on systems with GPUs in WDDM mode, and not being able to run on systems where the GPUs are not all identical in terms of compute capability. You can read some of the other restrictions in the programming guide.
Upvotes: 2