How do I mitigate CUDA's very long initialization delay?

Question

Initializing CUDA in a newly-created process can take quite some time as long as a half-second or more on many server-grade machines of today. As @RobertCrovella explains, CUDA initialization usually includes establishment of a Unified Memory model, which involves harmonizing of device and host memory maps. This can take quite a long time for machines with a lot of memory; and there might be other factors contributing to this long delay.

This effect becomes quite annoying when you want to run a sequence of CUDA-utilizing processes, which do not use complicated virtual memory mappings: They each have to wait their their long wait - despite the fact that "essentially", they could just re-use whether initializations CUDA made the last time (perhaps with a bit of cleanup code).

Now, obviously, if you somehow rewrote the code for all those processes to execute within a single process - that would save you those long initialization costs. But isn't there a simpler approach? What about:

Passing the same state information / CUDA context between processes?
Telling CUDA to ignore most host memory altogether?
Making the Unified Memory harmonization more lazy than it is now, so that it only happens to the extent that it's actually necessary?
Starting CUDA with Unified Memory disabled?
Keeping some daemon process on the side and latching on to it's already-initialized CUDA state?

How do I mitigate CUDA's very long initialization delay?

Answers (1)

Related Questions

How do I mitigate CUDA&#39;s very long initialization delay?

Answers (1)

Related Questions

How do I mitigate CUDA's very long initialization delay?