CUDA kernel with atomics producing wrong answer

Question

I am trying to understand the CUDA programming model and its features.As an exercise, I am trying to convert the following loop structure with a function call into an efficient CUDA kernel

//function call
bool gmul(int rowsize,int *Ai,int *Bj,int colsize)
{
    for(int i = 0;i < rowsize;i++)
    {
        for(int j = 0;j < colsize;j++)
        {
            if(Ai[i] == Bj[j])
            {
                return true;
            }
        }
    }
    return false;
}

//Some for loop in main function is as follows

for(i = 0;i < q ;i++)
    {
        cbeg = Bjc[i];
        cend = Bjc[i+1];        
        for(j = 0;j < m;j++)
        {
            beg = Aptr[j];
            end = Aptr[j+1];            
            if(gmul(end - beg,Acol + beg,Bir + cbeg,cend - cbeg))
            {   
                temp++;             
            }                       
        }
        Cjc1[i+1] = temp ;              
    }

And my kernel with function call is as follows.

    __device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
    {       
        for(int j = 0; j < rowsize;j++)
        {           
           for(int k = 0;k < colsize;k++)
            {   
              if(Aj[j] == Bi[k])
               {    
                return true;
                }                               
            }           
        }
            return false;       
    }


__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *count,int *Cjc)
    {
        int tid = threadIdx.x + blockIdx.x * blockDim.x;
        int i;
        if(tid < cols)
        {
            int beg = Bptr[tid];
            int end = Bptr[tid+1];
            for(i = 0;i < rows;i++)
            {
                int cbeg = Aptr[i];
                int cend = Aptr[i+1];
                if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg,count))
                {
                    //atomicAdd(count,1);
                                    //Changes made are in next line
                              atomicAdd(Cjc+tid+1,1);           
                }
            }
            //atomicAdd(Cjc+tid+1,*count);              
        }               
    }

What I want is that whenever __device__ mult is returned with true value, my global kernel function should increment the counter for that particular thread and once the for loop (in kernel function) ends,it should store the value into Cjc array and count is handed over to other threads for increment operation. However, I am not getting the expected value. All I get in this Cjc array is the final count once all the threads have finished executing.

I am using GTX 480 with CC 2.0

Any suggestions/hints as to why am I getting wrong answers or optimizations for this CUDA kernel will be appreciated. Thanks in advance. ********Solved***********

Right now,I am facing an issue that whenever I reach the size of 4000 and beyond, I am getting the value of all elements in an array as 0. Here is how I launch the kernel.

    int numBlocks,numThreads;

        if(q % 32 == 0)
        {
            numBlocks = q/32;
            numThreads = 32;
        }
        else
        {
            numBlocks = (q+31)/32;
            numThreads = 32;
        }
findkernel<<>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);

I was wondering I am crossing any limits for block or grid dimensions but for CC 2.0, I think I am just right to launch the sufficient blocks and threads that dont cross any limits. I wonder why still all the answers are coming out as 0.

Heatsink · Accepted Answer

You have written parallel threads that read and write count without synchronization. The threads run concurrently in an unpredictable order, so threads atomically modify count and read count in an unpredictable order. The expression *count will produce different results depending on the exact execution order.

once the for loop (in kernel function) ends, it should store the value into Cjc array and count is handed over to other threads for increment operation.

There is no synchronization, so no thread waits for another thread to finish the loop. Instead of making all threads share the same storage for count, why not give each thread a different piece of storage? Then the threads will not influence one another's result. You can run a scan kernel after this one to combine results.

CUDA kernel with atomics producing wrong answer

Answers (1)

Related Questions