Reputation: 1087
I am having an unknown error within my CUDA program and it seems to be related to the atomicadd function. I am coding on windows on Visual Studio 2015. My calling function is specified as the following
int regionWidth=32;
int regionHeight=32;
dim3 gridSize(765,765);
dim3 blockSize(regionWidth, regionHeight);
cudaMalloc((void **)&dev_count, sizeof(int));
count = 0;
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice);
crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight, dev_count);
cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost);
printf("total number of threads that executed was: %d vs. %d called -> %s\n", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");
then my global kernel function is
__global__
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;
shared_sum = 0;
sumGlobal(regionWidth, regionHeight, &shared_sum);
atomicAdd(ct, 1);
}
with sumGlobal defined as
__device__
void sumGlobal(int regionWidth, int regionHeight, int* global_sum)
{
// sum in nested loop
for (int y = 0; y < regionHeight; y++)
for (int x = 0; x < regionWidth; x++)
atomicAdd(global_sum, 1);
}
The build output from the program is the following
1> H:\GPU\GPU_PROJECT_HZDR\targeterConsole>"C:\Program Files\NVIDIA GPU
Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -
gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-
version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio
14.0\VC\bin\x86_amd64" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --
machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE
-D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o
x64\Release\targetDetectionGPU.cu.obj
"H:\GPU\GPU_PROJECT_HZDR\targetDetectionGPU.cu"
it's a standard Nvidia CUDA console project, only changed the arch to sm_50,compute_50
my program's output is the following (with debug information)
sharedMemBytes=36864
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32
There is 1 device supporting CUDA
Device 0: "GeForce GTX 1050 Ti"
CUDA Driver Version: 9.0
CUDA Runtime Version: 8.0
CUDA Capability Major revision number: 6
CUDA Capability Minor revision number: 1
Total amount of global memory: 0 bytes
Number of multiprocessors: 6
Number of cores: 288
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.39 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host
threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 0 vs. 599270400 called -> error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API
error (30): unknown error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API
error (30): unknown error
finshed cuda algorithm
with smaller grid sizes, it seems to work better
so when I instead choose 764, 764 grid size I get
Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 597704704 vs. 597704704 called ->
ok
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API
error (30): unknown error
with 750 x 750 the error was gone, with 760x760 the error was back.
The device specifications allows much larger grid sizes than 765, or am I missing something here? Not sure why a simple atomicAdd in a nested loop should cause these errors, is it a bug?
Ok, simplified the kernel call now, removed the function call and combined the loops into 1 but still the error on larger grid sizes, if I comment out the loop it runs ok.
__global__
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;
shared_sum = 0;
__syncthreads();
for (int y = 0; y < regionHeight*regionWidth; y++)
atomicAdd(&shared_sum, 1);
__syncthreads();
atomicAdd(ct, 1);
}
if I shorten the loop to
for (int y = 0; y < regionHeight; y++)
atomicAdd(&shared_sum, 1);
then it works ok, seems like a timeout issue, strange because I set the WDDM TDR timeout to 10 seconds with the NSight monitor.
Upvotes: 0
Views: 494
Reputation: 1087
If you get a "error (30): unknown error" suspect a TDR timeout, especially on Windows. Basically my test program was taking to long in the loops and causing a timeout. This is particularly the case when you are debugging using printf statements!
The solution is to increase the timeout value by changing the TDR setting to more like 30 seconds or so, increasing this value is not a problem when you are not using the GPU card for the main display. When the TDR value is increased you can better see that it is your program is taking too long and not something else. Try to improve the code by removing loops, especially those containing atomic operations, or restructure it to use techniques like reduction.
Upvotes: 1