How to calculate histogram using shared memory

Question

I want to calculate a histogram using shared memory, and I write the code like below, but I found the result of CUDA and CPU is different, and the result of CUDA is blockDim.x times the result of the CPU, why?
And how can I fix it? thank you!

__global__ void CalHistKernel(int*imgData,int*bins,int datasize)
{
    __shared__ int _bins[3];

    int tx=threadIdx.x;
    int idx=blockIdx.x*blockDim.x+threadIdx.x;//blockDim.y=1
    if(tx<3)
    {
        _bins[tx]=0;       
    }
    __syncthreads();

    if(idx

Roger Dahl · Accepted Answer

Looks like it's due due to the final loop, where you add the values in each shared memory bin to the global memory bins. Remember that the loop is executed in each thread.

How to calculate histogram using shared memory

Answers (1)

Related Questions