DengPan
DengPan

Reputation: 3

How to load data in global memory into shared memory SAFELY in CUDA?

My kernel:

__global__ void myKernel(float * devData, float * devVec, float * devStrFac,
int Natom, int vecNo) {

extern __shared__ float sdata[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;

float qx=devVec[3*idx];
float qy=devVec[3*idx+1];
float qz=devVec[3*idx+2];
__syncthreads();//sync_1

float c=0.0,s=0.0;
for (int iatom=0; iatom<Natom; iatom += blockDim.x) {
    float rtx = devData[3*(iatom + threadIdx.x)];//tag_0
    float rty = devData[3*(iatom + threadIdx.x)+1];
    float rtz = devData[3*(iatom + threadIdx.x)+2];
    __syncthreads();//sync_2
    sdata[3*threadIdx.x] = rtx;//tag_1
    sdata[3*threadIdx.x + 1] = rty;
    sdata[3*threadIdx.x + 2] = rtz;
    __syncthreads();//sync_3

    int end_offset=  min(blockDim.x, Natom - iatom);

    for (int cur_offset=0; cur_offset<end_offset; cur_offset++) {
        float rx = sdata[3*cur_offset];
        float ry = sdata[3*cur_offset + 1];
        float rz = sdata[3*cur_offset + 2];
        //sync_4  
        float theta = rx*qx + ry*qy + rz*qz;

        theta = theta - lrint  (theta);
        theta = theta * 2 * 3.1415926;//reduce theta to [-pi,pi]

        float ct,st;
        sincosf(theta,&st,&ct);

        c += ct;
        s += st;
    }

}

devStrFac[idx] += c*c + s*s;
}

why "__syncthreads()" labeled sync_2 is needed? Without sync_2, sdata[] get wrong numbers and I get wrong results. Line "tag_1" use the results of line "tag_0", so in my mind sync_2 is no need. Where do I wrong? If due to disorderd instruction executing, I should put a __syncthreads() in line "sync_4"?

Upvotes: 0

Views: 535

Answers (1)

Maxim Milakov
Maxim Milakov

Reputation: 221

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from that shared memory, which is race condition.

You might move this __syncthreads() at label sync2 to the end of the outer loop for the sake of clarity.

"cuda-memcheck --tool racecheck" should tell you where the problem is.

Upvotes: 3

Related Questions