ejectamenta
ejectamenta

Reputation: 1087

CUDA problems with atomicadd in nested loop on medium sized grids (>760 by 760)

I am having an unknown error within my CUDA program and it seems to be related to the atomicadd function. I am coding on windows on Visual Studio 2015. My calling function is specified as the following

int regionWidth=32;
int regionHeight=32;
dim3 gridSize(765,765);
dim3 blockSize(regionWidth, regionHeight);

cudaMalloc((void **)&dev_count, sizeof(int));
count = 0;
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice);

crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight,  dev_count);

cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost);

printf("total number of threads that executed was: %d vs. %d called -> %s\n", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");

then my global kernel function is

 __global__ 
 void crashFN(int regionWidth, int regionHeight, int* ct)
 {
     __shared__ int shared_sum;

     shared_sum = 0;

     sumGlobal(regionWidth, regionHeight, &shared_sum);

     atomicAdd(ct, 1);
}

with sumGlobal defined as

 __device__
 void sumGlobal(int regionWidth, int regionHeight, int* global_sum)
 {
     // sum in nested loop
     for (int y = 0; y < regionHeight; y++)
         for (int x = 0; x < regionWidth; x++)
                atomicAdd(global_sum, 1);
 }

The build output from the program is the following

1>  H:\GPU\GPU_PROJECT_HZDR\targeterConsole>"C:\Program Files\NVIDIA GPU 
Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -
 gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-
 version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 
 14.0\VC\bin\x86_amd64"  -I"C:\Program Files\NVIDIA GPU Computing 
 Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing 
 Toolkit\CUDA\v8.0\include"     --keep-dir x64\Release -maxrregcount=0  --
 machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE 
 -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi  /MD " -o 
 x64\Release\targetDetectionGPU.cu.obj 
 "H:\GPU\GPU_PROJECT_HZDR\targetDetectionGPU.cu"

it's a standard Nvidia CUDA console project, only changed the arch to sm_50,compute_50

my program's output is the following (with debug information)

sharedMemBytes=36864
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32
There is 1 device supporting CUDA

Device 0: "GeForce GTX 1050 Ti"
  CUDA Driver Version:                           9.0
  CUDA Runtime Version:                          8.0
  CUDA Capability Major revision number:         6
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 0 bytes
  Number of multiprocessors:                     6
  Number of cores:                               288
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.39 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host             
  threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime 
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 0 vs. 599270400 called -> error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API 
error (30): unknown error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API 
error (30): unknown error
finshed cuda algorithm

with smaller grid sizes, it seems to work better

so when I instead choose 764, 764 grid size I get

Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 597704704 vs. 597704704 called -> 
ok
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API 
error (30): unknown error

with 750 x 750 the error was gone, with 760x760 the error was back.

The device specifications allows much larger grid sizes than 765, or am I missing something here? Not sure why a simple atomicAdd in a nested loop should cause these errors, is it a bug?

Ok, simplified the kernel call now, removed the function call and combined the loops into 1 but still the error on larger grid sizes, if I comment out the loop it runs ok.

__global__ 
void crashFN(int regionWidth, int regionHeight, int* ct)
{
     __shared__ int shared_sum;

     shared_sum = 0;
     __syncthreads();

    for (int y = 0; y < regionHeight*regionWidth; y++)
           atomicAdd(&shared_sum, 1);

    __syncthreads();

    atomicAdd(ct, 1);
}

if I shorten the loop to

  for (int y = 0; y < regionHeight; y++)
          atomicAdd(&shared_sum, 1);

then it works ok, seems like a timeout issue, strange because I set the WDDM TDR timeout to 10 seconds with the NSight monitor.

Upvotes: 0

Views: 494

Answers (1)

ejectamenta
ejectamenta

Reputation: 1087

If you get a "error (30): unknown error" suspect a TDR timeout, especially on Windows. Basically my test program was taking to long in the loops and causing a timeout. This is particularly the case when you are debugging using printf statements!

The solution is to increase the timeout value by changing the TDR setting to more like 30 seconds or so, increasing this value is not a problem when you are not using the GPU card for the main display. When the TDR value is increased you can better see that it is your program is taking too long and not something else. Try to improve the code by removing loops, especially those containing atomic operations, or restructure it to use techniques like reduction.

http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

Upvotes: 1

Related Questions