CUDA:illegal memory access was encountered with atomicAdd

Нhe atomic operation in my program works correctly as long as I don't increase the grid size or call the kernel again. How can this be? Perhaps shared memory isn't automatically freed?

  __global__ void DevTest() {
     __shared__  int* k1;
    k1 = new int(0);
    atomicAdd( k1, 1);
  }
int main()
{
for (int i = 0; i < 100; i++) DevTest << < 50, 50 >> > (); 
}

Upvotes: 1

Views: 537

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152143

This:

 __shared__  int* k1;

creates storage in shared memory for a pointer. The pointer is uninitialized there; it doesn't point to anything.

This:

k1 = new int(0);

sets that pointer to point to a location on the device heap, not in shared memory. The device heap is limited by default to 8MB. Furthermore, there is an allocation granularity, such that a single int allocation may use up more than 4 bytes of device heap space (it will).

It's generally good practice in C++ to have a corresponding delete for every new operation. Your kernel code does not have this, so as you increase the grid size, you will use up more and more of the device heap memory. You will eventually run into the 8MB limit.

So there are at least 2 options to fix this:

  1. delete the allocation created with new at the end of the thread code
  2. increase the limit on the device heap, instructions are linked in the documentation above

As an aside, shared memory is shared by all the threads in the threadblock. So for your kernel launch of <<<50,50>>> you have 50 threads in each threadblock. Each of those 50 threads will see the same k1 pointer, and each will try to set it to a separate location/allocation, as each executes the new operation. This doesn't make any sense, and it will prevent item 1 above from working correctly (49 of the 50 allocated pointer values will be lost).

So your code doesn't really make any sense. What you are showing here is not a sensible thing to do, and there is no simple way to fix it. You could do something like this:

  __global__ void DevTest() {
    __shared__  int* k1;
    if (threadIdx.x == 0) k1 = new int(0);
    __syncthreads();
    atomicAdd( k1, 1);
    __syncthreads();
    if (threadIdx.x == 0) delete k1;
  }

Even such an arrangement could theoretically (perhaps, on some future large GPU) eventually run into the 8MB limit if you launched enough blocks (i.e. if there were enough resident blocks, and taking into account an undefined allocation granularity). So the correct approach is probably to do both 1 and 2.

Upvotes: 2

Related Questions