How to load data in global memory into shared memory SAFELY in CUDA?

Question

My kernel:

__global__ void myKernel(float * devData, float * devVec, float * devStrFac,
int Natom, int vecNo) {

extern __shared__ float sdata[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;

float qx=devVec[3*idx];
float qy=devVec[3*idx+1];
float qz=devVec[3*idx+2];
__syncthreads();//sync_1

float c=0.0,s=0.0;
for (int iatom=0; iatom



why "__syncthreads()" labeled sync_2 is needed? Without sync_2, sdata[] get wrong numbers and I get wrong results. Line "tag_1" use the results of line "tag_0", so in my mind sync_2 is no need. Where do I wrong? If due to disorderd instruction executing, I should put a __syncthreads() in line "sync_4"?

Maxim Milakov · Accepted Answer

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from that shared memory, which is race condition.

You might move this __syncthreads() at label sync2 to the end of the outer loop for the sake of clarity.

"cuda-memcheck --tool racecheck" should tell you where the problem is.

How to load data in global memory into shared memory SAFELY in CUDA?

Answers (1)

Related Questions