Reputation: 176
hello I want to find the sum of array elements using CUDA.
__global__ void countZeros(int *d_A, int * B)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
B[0] = B[0]+d_A[index];
}
so in the end, B[0] supposed to contain the sum of all elements. but I noticed that B[0] equals to zero every time. so in the end it contains only last element. why B[0] becomes zero every time?
Upvotes: 1
Views: 3928
Reputation: 152269
All of the threads are writing to B[0]
, and some may be attempting to write simultaneously. This line of code:
B[0] = B[0]+d_A[index];
requires a read and a write of B[0]
. If multiple threads are doing this at the same time, you will get strange results.
You can make a simple fix by doing this:
atomicAdd(B, d_A[index]);
and you should get sensible results (assuming you have no errors elsewhere in your code, that you haven't shown.) Be sure to initialize B[0]
to some known value before calling this kernel.
If you want to do this efficiently, however, you should study the cuda reduction sample or just use CUB.
And be sure to use proper cuda error checking any time you are having trouble with a CUDA code.
So, if you still can't get sensible results, please instrument your code with proper cuda error checking before asking "I made this change but it still doesn't work, why?" I can't tell you why, because this is the only snippet of code that you've shown.
Upvotes: 4