username_4567
username_4567

Reputation: 4923

Strange behaviour of CUDA kernel

I'm writing code for calculating prefix sum. Here is my kernel

__global__ void prescan(int *indata,int *outdata,int n,long int *sums)  
{  
    extern __shared__ int temp[];  

    int tid=threadIdx.x;
    int offset=1,start_id,end_id;
    int *global_sum=&temp[n+2];

    if(tid==0)
    {
        temp[n]=blockDim.x*blockIdx.x;
        temp[n+1]=blockDim.x*(blockIdx.x+1)-1;
        start_id=temp[n];
        end_id=temp[n+1];
        //cuPrintf("Value of start %d and end %d\n",start_id,end_id);

    }
    __syncthreads();
    start_id=temp[n];
    end_id=temp[n+1];
    temp[tid]=indata[start_id+tid];
    temp[tid+1]=indata[start_id+tid+1];


    for(int d=n>>1;d>0;d>>=1)
    {
        __syncthreads();
        if(tid<d)
        {
            int ai=offset*(2*tid+1)-1;
            int bi=offset*(2*tid+2)-1;

            temp[bi]+=temp[ai];
        }
        offset*=2;
    }

    if(tid==0)
    {  
        sums[blockIdx.x]=temp[n-1];  
        temp[n-1]=0;
        cuPrintf("sums %d\n",sums[blockIdx.x]);
    }
    for(int d=1;d<n;d*=2)
    {
        offset>>=1;
        __syncthreads();
        if(tid<d)
        {
            int ai=offset*(2*tid+1)-1;
            int bi=offset*(2*tid+2)-1;
            int t=temp[ai];
            temp[ai]=temp[bi];
            temp[bi]+=t;
        }
    }

    __syncthreads();

    if(tid==0)
    {
        outdata[start_id]=0;
    }

    __threadfence_block();
    __syncthreads();
    outdata[start_id+tid]=temp[tid];
    outdata[start_id+tid+1]=temp[tid+1];
    __syncthreads();

    if(tid==0)  
    {  
        temp[0]=0;  
        outdata[start_id]=0;  

    }  

    __threadfence_block();
    __syncthreads();

    if(blockIdx.x==0 && threadIdx.x==0)
    {
        for(int i=1;i<gridDim.x;i++)
        {
            sums[i]=sums[i]+sums[i-1];
        }
    }

    __syncthreads();
    __threadfence();

    if(blockIdx.x==0 && threadIdx.x==0)
    {
        for(int i=0;i<gridDim.x;i++)
        {
            cuPrintf("****sums[%d]=%d ",i,sums[i]);
        }
    }


    __syncthreads();
    __threadfence();


    if(blockIdx.x!=gridDim.x-1)
    {
        int tid=(blockIdx.x+1)*blockDim.x+threadIdx.x;
        if(threadIdx.x==0)
            cuPrintf("Adding %d \n",sums[blockIdx.x]);
        outdata[tid]+=sums[blockIdx.x];

    }
    __syncthreads();

}

In above kernel, sums array will accumulate prefix sum per block and and then first thread will calculate prefix sum of this sum array. Now if I print this sum array from device side it'll show correct results while in

cuPrintf("Adding %d \n",sums[blockIdx.x]);

this line it prints that it is taking old value. What could be the reason?

Upvotes: 1

Views: 441

Answers (1)

harrism
harrism

Reputation: 27899

Your code is not a valid implementation of multi-block prefix sum. You are trying to use a single thread of block 0 to compute the prefix sum of the partial block sums, before those partial block sums are guaranteed to have been written to memory. __syncthreads() only synchronizes threads in a single block, not across blocks. So in this code:

__threadfence_block();
__syncthreads();

if(blockIdx.x==0 && threadIdx.x==0)
{
    for(int i=1;i<gridDim.x;i++)
    {
        sums[i]=sums[i]+sums[i-1];
    }
}

All blocks are not guaranteed to have computed their sums[blockIdx.x] before block 0 executes this code. In fact, if you launch more blocks than can run concurrently on the device, all blocks are not guaranteed to have even started when you reach this code.

To make this code correct, you need to end the kernel before this code, and launch a new kernel to compute the block prefix sum result, and another one to add the result of that to each thread block.

Upvotes: 2

Related Questions