Reputation: 148
I am doing some simple parallel computations on an array with CUDA C++. Everything works fine, the kernel outputs correct results (checked with serial CPU code), but while the kernel is executing, my entire screen goes black, for the entire duration of the kernel execution. I am new to CUDA, so I am probably doing something wrong, I just can't seem to figure out what.
#define KERNEL_FOR_ITERS 1e6
__global__ void addKernel(float *c, const float *a, const float *b)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i % 2 == 0)
for (int j = 0; j < KERNEL_FOR_ITERS; j++)
c[i] += sqrt(abs(sin(a[i] * b[i])));
else
for (int j = 0; j < KERNEL_FOR_ITERS; j++)
c[i] += sqrt(abs(cos(a[i] * b[i])));
}
My question is, can I prevent my screen going black for the entire kernel execution duration, without compromising on too much speed?
Upvotes: 1
Views: 756
Reputation: 152173
It's somewhat useful if you describe your setup, including the OS and the GPU you are running on, whether or not the GPU is driving a display, and if on the windows OS, the GPU is in WDDM or TCC mode.
However we can make some general statements without that.
As pointed out in the comments, at the current time, a GPU running a CUDA kernel will not service display requests, if it is also supporting a display. This means that the display will appear to "freeze" or perhaps turn black possibly, while a GPU kernel is running. It's certainly possible that this may change in the future, but that is current and expected behavior.
The usual suggestions in this case are to use a 2nd GPU for running CUDA if you don't want to disturb the display at all, and if you are on windows, its best if that GPU is capable of, and placed in TCC mode.
To mitigate the effect while using only a single GPU, and indeed to provide CUDA support for production purposes in a single-GPU display environment, its important that the CUDA side of the application be designed in such a way that the kernel duration is limited. For good interactivity, a reasonable starting point is to limit kernel duration to 0.1 seconds or less, as that level of loss of interactivity may be not particularly noticeable. If you or someone disagrees with that human factors statement, it's OK; we need not argue about it. Reduce the kernel duration to whatever level you decide will result in good display interactivity.
The situation is further complicated in the windows case (not, to my knowledge, in the linux case) by WDDM command batching. To improve performance, commands may be batched, and the batching of back-to-back kernel calls may result in perceived longer periods of loss of interactivity, than just a single kernel call would indicate. There are no methods that I'm aware of to formally work around this. You may be able to "flush" the WDDM command queue by issuing a spurious (i.e. not otherwise necessary) CUDA operation such as a cudaStreamQuery()
, after each kernel call. Again, I don't know of formally documented methods for this, and to some extent it may depend on your application design.
With respect to performance, CUDA kernel launches typically involve somewhere around 100 microseconds or less of launch overhead (we can call it wasted time). Therefore, if we break up a long-running kernel into 100 millisecond "chunks", and each chunk adds ~100 microseconds of overhead, then the net effect on performance might be on the order of 0.1% reduction in CUDA computation throughput (assuming the display tasks are trivial).
Using your supplied code as an example, you would want to break up that kernel into a sequence of kernels, benchmarking/timing it on the GPU of your choice, so that the kernel runs for no longer than about 100 milliseconds (or a number of your choosing).
#define KERNEL_FOR_ITERS 1e6
__global__ void addKernel(float *c, const float *a, const float *b,const int iters)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i % 2 == 0)
for (int j = 0; j < iters; j++)
c[i] += sqrt(abs(sin(a[i] * b[i])));
else
for (int j = 0; j < iters; j++)
c[i] += sqrt(abs(cos(a[i] * b[i])));
}
...
const int loop_iters = 1e4; // chosen by tuning or benchmarking
cudaStream_t str;
cudaStreamCreate(&str);
for (int i = 0; i < KERNEL_FOR_ITERS; i+= loop_iters){
addKernel<<<...,0,str>>>(d_c, d_a, d_b, loop_iters);
cudaStreamQuery(str);//probably unnecessary on linux}
I don't imagine this is the kernel you are actually using, but as an aside, its performance characteristics can possibly be improved by restricting the things that are actually different between threads to a small section of code. For example:
__global__ void addKernel(float *c, const float *a, const float *b,const int iters)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
float val = a[i] * b[i];
if (i % 2 == 0)
val = sin(val);
else
val = cos(val);
for (int j = 0; j < iters; j++)
c[i] += sqrt(abs(val));
}
The compiler may figure out this sort of contraction anyway, but I usually would try to give it the best possible "head start".
Upvotes: 2