Reputation: 3
My kernel:
__global__ void myKernel(float * devData, float * devVec, float * devStrFac,
int Natom, int vecNo) {
extern __shared__ float sdata[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float qx=devVec[3*idx];
float qy=devVec[3*idx+1];
float qz=devVec[3*idx+2];
__syncthreads();//sync_1
float c=0.0,s=0.0;
for (int iatom=0; iatom<Natom; iatom += blockDim.x) {
float rtx = devData[3*(iatom + threadIdx.x)];//tag_0
float rty = devData[3*(iatom + threadIdx.x)+1];
float rtz = devData[3*(iatom + threadIdx.x)+2];
__syncthreads();//sync_2
sdata[3*threadIdx.x] = rtx;//tag_1
sdata[3*threadIdx.x + 1] = rty;
sdata[3*threadIdx.x + 2] = rtz;
__syncthreads();//sync_3
int end_offset= min(blockDim.x, Natom - iatom);
for (int cur_offset=0; cur_offset<end_offset; cur_offset++) {
float rx = sdata[3*cur_offset];
float ry = sdata[3*cur_offset + 1];
float rz = sdata[3*cur_offset + 2];
//sync_4
float theta = rx*qx + ry*qy + rz*qz;
theta = theta - lrint (theta);
theta = theta * 2 * 3.1415926;//reduce theta to [-pi,pi]
float ct,st;
sincosf(theta,&st,&ct);
c += ct;
s += st;
}
}
devStrFac[idx] += c*c + s*s;
}
why "__syncthreads()" labeled sync_2 is needed? Without sync_2, sdata[] get wrong numbers and I get wrong results. Line "tag_1" use the results of line "tag_0", so in my mind sync_2 is no need. Where do I wrong? If due to disorderd instruction executing, I should put a __syncthreads() in line "sync_4"?
Upvotes: 0
Views: 535
Reputation: 221
Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads
at label sync2
, you will end up with this warp writing to shared memory while others are reading from that shared memory, which is race condition.
You might move this __syncthreads()
at label sync2
to the end of the outer loop for the sake of clarity.
"cuda-memcheck --tool racecheck" should tell you where the problem is.
Upvotes: 3