atomic operation disrupting all kernels

Question

I am running some image processing operations on GPU and I need the histogram of the output. I have written and tested the processing kernels. Also I have tested the histogram kernel for samples of the output pictures separately. They both work fine but when I put all of them in one loop I get nothing.

This is my histogram kernel:

__global__ void histogram(int n, uchar* color, uchar* mask, int* bucket, int ch, int W, int bin)
{
    unsigned int X = blockIdx.x*blockDim.x+threadIdx.x;
    unsigned int Y = blockIdx.y*blockDim.y+threadIdx.y;

    int l = (256%bin==0)?256/bin: 256/bin+1;
    int c;

    if (X+Y*W < n && mask[X+Y*W])
    {
        c = color[(X+Y*W)*3]/bin;
        atomicAdd(&bucket[c], 1);

        c = color[(X+Y*W)*3+1]/bin;
        atomicAdd(&bucket[c+l], 1);

        c = color[(X+Y*W)*3+2]/bin;
        atomicAdd(&bucket[c+l*2], 1);
    }
}

It is updating histogram vectors for red, green, and blue.('l' is the length of the vectors) When I comment out atomicAdds it again produces the output but of course not the histogram. Why don't they work together?

Edit:

This is the loop:

    cudaMemcpy(frame_in_gpu,frame_in.data, W*H*3*sizeof(uchar),cudaMemcpyHostToDevice);
    cuda_process(frame_in_gpu, frame_out_gpu, W, H, dimGrid,dimBlock);
    cuda_histogram(W*H, frame_in_gpu, mask_gpu, hist, 3, W, bin, dimg_histogram, dimb_histogram);

Then I copy the output to host memory and write it to a video. These are c codes that only call their kernels with dimGrid and dimBlock that are given as inputs. Also:

dim3 dimBlock(32,32);
dim3 dimGrid(W/32,H/32);
dim3 dimb_Histogram(16,16);
dim3 dimg_Histogram(W/16,H/16);

I changed this for histogram because it worked better with it. Does it matter?

Edit2: I am using -arch=sm_11 option for compilation. I just read it somewhere. Could anyone tell me how I should choose it?

user1545642 · Accepted Answer

perhaps you should try to compile without -arch=sm_11 flag. sm 1.1 is the first architecture which supported atomic operations on global memory while your GPU supports SM 2.0. Hence there is no reason to compile for SM 1.1 unless for backward compatibility.

One possible issue could be that SM 1.1 does not support atomic operations on 64-bit ints in global memory. So I would suggest you recompile the code without -arch option, or use -arch=sm_20 if you like

atomic operation disrupting all kernels

Answers (1)

Related Questions