CUDA problems with atomicadd in nested loop on medium sized grids (760 by 760)

Question

I am having an unknown error within my CUDA program and it seems to be related to the atomicadd function. I am coding on windows on Visual Studio 2015. My calling function is specified as the following

int regionWidth=32;
int regionHeight=32;
dim3 gridSize(765,765);
dim3 blockSize(regionWidth, regionHeight);

cudaMalloc((void **)&dev_count, sizeof(int));
count = 0;
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice);

crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight,  dev_count);

cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost);

printf("total number of threads that executed was: %d vs. %d called -> %s
", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");

then my global kernel function is

 __global__ 
 void crashFN(int regionWidth, int regionHeight, int* ct)
 {
     __shared__ int shared_sum;

     shared_sum = 0;

     sumGlobal(regionWidth, regionHeight, &shared_sum);

     atomicAdd(ct, 1);
}

with sumGlobal defined as

 __device__
 void sumGlobal(int regionWidth, int regionHeight, int* global_sum)
 {
     // sum in nested loop
     for (int y = 0; y < regionHeight; y++)
         for (int x = 0; x < regionWidth; x++)
                atomicAdd(global_sum, 1);
 }

The build output from the program is the following

1>  H:\GPU\GPU_PROJECT_HZDR	argeterConsole>"C:\Program Files\NVIDIA GPU 
Computing Toolkit\CUDA\v8.0\bin
vcc.exe" -
 gencode=arch=compute_50,code="sm_50,compute_50" --use-local-env --cl-
 version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 
 14.0\VC\bin\x86_amd64"  -I"C:\Program Files\NVIDIA GPU Computing 
 Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing 
 Toolkit\CUDA\v8.0\include"     --keep-dir x64\Release -maxrregcount=0  --
 machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE 
 -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi  /MD " -o 
 x64\Release	argetDetectionGPU.cu.obj 
 "H:\GPU\GPU_PROJECT_HZDR	argetDetectionGPU.cu"

it's a standard Nvidia CUDA console project, only changed the arch to sm_50,compute_50

my program's output is the following (with debug information)

sharedMemBytes=36864
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32
There is 1 device supporting CUDA

Device 0: "GeForce GTX 1050 Ti"
  CUDA Driver Version:                           9.0
  CUDA Runtime Version:                          8.0
  CUDA Capability Major revision number:         6
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 0 bytes
  Number of multiprocessors:                     6
  Number of cores:                               288
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.39 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host             
  threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime 
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 0 vs. 599270400 called -> error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API 
error (30): unknown error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API 
error (30): unknown error
finshed cuda algorithm

with smaller grid sizes, it seems to work better

so when I instead choose 764, 764 grid size I get

Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 597704704 vs. 597704704 called -> 
ok
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API 
error (30): unknown error

with 750 x 750 the error was gone, with 760x760 the error was back.

The device specifications allows much larger grid sizes than 765, or am I missing something here? Not sure why a simple atomicAdd in a nested loop should cause these errors, is it a bug?

Ok, simplified the kernel call now, removed the function call and combined the loops into 1 but still the error on larger grid sizes, if I comment out the loop it runs ok.

__global__ 
void crashFN(int regionWidth, int regionHeight, int* ct)
{
     __shared__ int shared_sum;

     shared_sum = 0;
     __syncthreads();

    for (int y = 0; y < regionHeight*regionWidth; y++)
           atomicAdd(&shared_sum, 1);

    __syncthreads();

    atomicAdd(ct, 1);
}

if I shorten the loop to

  for (int y = 0; y < regionHeight; y++)
          atomicAdd(&shared_sum, 1);

then it works ok, seems like a timeout issue, strange because I set the WDDM TDR timeout to 10 seconds with the NSight monitor.

CUDA problems with atomicadd in nested loop on medium sized grids (>760 by 760)

Answers (1)

Related Questions

CUDA problems with atomicadd in nested loop on medium sized grids (&gt;760 by 760)

Answers (1)

Related Questions

CUDA problems with atomicadd in nested loop on medium sized grids (>760 by 760)