Reputation: 31
I've recently begun learning CUDA, and I've stumbled upon a very strange behavior I can't understand.
My code essentially computes an average execution time for a simple atomicAdd kernel. To accomplish this, I call the kernel in a loop to get a better average. I include the device memory allocation and copies in the loop as I want to include this in my execution time estimate. The problem is, the program often fails with Runtime API error 30 if the number of runs through the loop is too high.
I suspected that I might have an issue with my memory access, so I've run memcheck on the program to no avail. There are apparently no memory errors. Also, if run the kernel only a few times, there are no issues, which would also seem to indicate the kernel isn't exactly the issue. It's only if I call it too frequently in succession that I have problems.
A skeleton of my code follows:
for(int i = 0; i < runs; i++)
{
//////////////////////////////////
// Copy memory from Host to Device
//////////////////////////////////
cutilSafeCallNoSync( cudaMemcpy(dev_waveforms, waveforms, num_wf * wf_length * sizeof(float),
cudaMemcpyHostToDevice) );
cutilSafeCallNoSync( cudaMemcpy(dev_delays, delays, num_wf * sizeof(int),
cudaMemcpyHostToDevice) );
////////////////////////
// Kernel Call
////////////////////////
kernel_wrapper<float>(dev_waveforms, dev_focused, dev_delays,
wf_length, num_wf, threads, blocks, kernel);
//copy back to host memory.
cutilSafeCallNoSync( cudaMemcpy(focused, dev_focused, J * wf_length * sizeof(float),
cudaMemcpyDeviceToHost) );
}
Again, this only fails if runs is sufficiently large. There are other strange things going on, but I'll leave it at this for now.
Oh, I'm developing on Windows 7 using Visual Studio 2010. My GPU is also acting as my video card, and I'm worried this may have strange effects.
Thanks in advance!
Upvotes: 3
Views: 5737
Reputation: 83
For anybody else coming to this post looking for an answer to why you are getting an error 30 message:
You will also get this error if you accidentally put a CPU variable as one of the arguments for your GPU device function. This is the most common cause of this problem for me. You would think after so many times of accidentally placing the cpu copy of the variable as an argument you would learn, but ...
Make sure all of your arguments for your device function: myDeviceFunciont<<<1,N>>>(argument1, argument2, argument3)
are GPU variables (ie: the variables you used in cudaMalloc & cudaMemcpy to allocate memory on the GPU)
Upvotes: 1
Reputation: 99
I ran into the same error and found that my kernel was actually overrunning the memory I had allocated. Since you doubled your buffers and saw the problem go away, I would expect that you may be experiencing the same issue.
My issue was a bug in my math to determine how many threads and blocks to launch. I was launching eight times as many blocks as I intended. Inside my kernel, the math to determine which element a given thread should work on resulted in accessing way outside my array.
Make sure you check which element(s) of the array each thread is working with to prevent execution of threads that would be accessing/modifying memory outside the array.
Upvotes: 1
Reputation: 27809
The Windows 7 driver may batch up multiple commands into a single submission to get around the increased driver overhead of the WDDM (compared to pre-WDDM drivers, e.g. Win XP). For this reason even if a single kernel does not exceed the watchdog, running in a loop like this might. You could call cudaDeviceSynchronize()
as @RogerDahl suggests to try to work around it (possibly only every N iterations).
Or run on Linux.
Edit:
Runtime Error 30 is an unknown error. If this were a watchdog timer timeout, I would expect a cudaErrorLaunchTimeout
(error 6). Since you didn't provide full code, it's hard to say what is causing the error. I suspect there is a bug in your kernel code.
Upvotes: 2