Reputation: 13
I want to calculate a histogram using shared memory, and I write the code like below, but I found the result of CUDA and CPU is different, and the result of CUDA is blockDim.x
times the result of the CPU, why?
And how can I fix it?
thank you!
__global__ void CalHistKernel(int*imgData,int*bins,int datasize)
{
__shared__ int _bins[3];
int tx=threadIdx.x;
int idx=blockIdx.x*blockDim.x+threadIdx.x;//blockDim.y=1
if(tx<3)
{
_bins[tx]=0;
}
__syncthreads();
if(idx<datasize)
{
atomicAdd((int*)&_bins[imgData[idx]],1);
}
__syncthreads();
for(int i=0;i<3;i++)
atomicAdd((int*)&bins[i],_bins[i]);
}
Upvotes: 1
Views: 1750
Reputation: 15734
Looks like it's due due to the final loop, where you add the values in each shared memory bin to the global memory bins. Remember that the loop is executed in each thread.
Upvotes: 3