Correct place to use cudaSetDeviceFlags?

Question

Win10 x64, CUDA 8.0, VS2015, 6-core CPU (12 logical cores), 2 GTX580 GPUs.

In general, I'm working on a multithreaded application that launches 2 threads that are associated with 2 GPUs available, these threads are stored in a thread pool.

Each thread does the following initialization procedure upon it's launch (i.e. this is done only ones during the runtime of each thread):

::cudaSetDevice(0 or 1, as we have only two GPUs);
::cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
::cudaSetDeviceFlags(cudaDeviceMapHost | cudaDeviceScheduleBlockingSync);

Then, from other worker threads (12 more threads that do not touch GPUs at all), I begin feeding these 2 GPU-associated worker threads with data, it works perfectly as long as the number of GPU threads being laucnhed is equal to the number of physical GPUs available.

Now I want to launch 4 GPU threads (i.e 2 threads per GPU) and make each one work via separate CUDA stream. I know the requirements that are essential for proper CUDA streams usage, I meet all of them. What I'm failing on is the initialization procedure mentioned above.

As soon as this procedure is attempted to be executed twice from different GPU threads but for the same GPU, the ::cudaSetDeviceFlags(...) starts failing with "cannot set while device is active in this process" error message.

I have looked into the manual and seems like I get the reason why this happens, what I can't understand is how to use ::cudaSetDeviceFlags(...) for my setup properly.

I can comment this ::cudaSetDeviceFlags(...) line and the propgram will work fine even for 8 thread per GPU, but I need the cudaDeviceMapHost flag to be set in order to use streams, pinned memory won't be available otherwise.

EDIT Extra info to consider #1:

If to call ::cudaSetDeviceFlags before ::cudaSetDevice then no error occurs.
Each GPU thread allocates a chunk of pinned memory via ::VirtualAlloc ->::cudaHostRegister approach upon thread launch (works just fine no matter how many GPU threads launched) and deallocates it upon thread termination (via ::cudaHostUnregister -> ::VirtualFree). ::cudaHostUnregister fails with "pointer does not correspond to a registered memory region" for half the threads if the number of threads per GPU is greater than 1.

Roman · Accepted Answer

Well, highly sophisticated method of trythis-trythat-seewhathappens-tryagain practice finally did the trick, as always.

Here is the excerpt from the documentation on ::cudaSetDeviceFlags():

Records flags as the flags to use when initializing the current device. If no device has been made current to the calling thread, then flags will be applied to the initialization of any device initialized by the calling host thread, unless that device has had its initialization flags set explicitly by this or any host thread.

Consequently, in the GPU worker thread it is necessary to call ::cudaSetDeviceFlags() before ::cudaSetDevice().

I have implemented somthing like this in the GPU thread initialization code in order to make sure that device flags being set before the device set are actually applied properly:

bse__throw_CUDAHOST_FAILED(::cudaSetDeviceFlags(nFlagsOfDesire));
bse__throw_CUDAHOST_FAILED(::cudaSetDevice(nDevice));

unsigned int nDeviceFlagsActual = 0;
bse__throw_CUDAHOST_FAILED(::cudaGetDeviceFlags(&nDeviceFlagsActual));
bse__throw_IF(nFlagsOfDesire != nDeviceFlagsActual);

Also, the comment of talonmies showed the way to resolve the ::cudaHostUnregister errors.

Correct place to use cudaSetDeviceFlags?

Answers (1)

Related Questions