Reputation: 13
Нhe atomic operation in my program works correctly as long as I don't increase the grid size or call the kernel again. How can this be? Perhaps shared memory isn't automatically freed?
__global__ void DevTest() {
__shared__ int* k1;
k1 = new int(0);
atomicAdd( k1, 1);
}
int main()
{
for (int i = 0; i < 100; i++) DevTest << < 50, 50 >> > ();
}
Upvotes: 1
Views: 537
Reputation: 152143
This:
__shared__ int* k1;
creates storage in shared memory for a pointer. The pointer is uninitialized there; it doesn't point to anything.
This:
k1 = new int(0);
sets that pointer to point to a location on the device heap, not in shared memory. The device heap is limited by default to 8MB. Furthermore, there is an allocation granularity, such that a single int
allocation may use up more than 4 bytes of device heap space (it will).
It's generally good practice in C++ to have a corresponding delete
for every new
operation. Your kernel code does not have this, so as you increase the grid size, you will use up more and more of the device heap memory. You will eventually run into the 8MB limit.
So there are at least 2 options to fix
this:
delete
the allocation created with new
at the end of the thread codeAs an aside, shared memory is shared by all the threads in the threadblock. So for your kernel launch of <<<50,50>>>
you have 50 threads in each threadblock. Each of those 50 threads will see the same k1
pointer, and each will try to set it to a separate location/allocation, as each executes the new
operation. This doesn't make any sense, and it will prevent item 1 above from working correctly (49 of the 50 allocated pointer values will be lost).
So your code doesn't really make any sense. What you are showing here is not a sensible thing to do, and there is no simple way to fix it. You could do something like this:
__global__ void DevTest() {
__shared__ int* k1;
if (threadIdx.x == 0) k1 = new int(0);
__syncthreads();
atomicAdd( k1, 1);
__syncthreads();
if (threadIdx.x == 0) delete k1;
}
Even such an arrangement could theoretically (perhaps, on some future large GPU) eventually run into the 8MB limit if you launched enough blocks (i.e. if there were enough resident blocks, and taking into account an undefined allocation granularity). So the correct approach is probably to do both 1 and 2.
Upvotes: 2