Race conditions despite atomicAdd functions (CUDA)?

Question

I have a problem that is parallel on two levels: I have a ton of sets of (x0, x1, y0, y1) coordinate pairs, which are turned into variables vdx, vdy, vyy and for each of these sets I'm trying to calculate the values of all "monomials" composed of them up to degree n (i.e. all possible combinations of different powers of them, like vdx^3*vdy*vyy^2 or vdx*1*vyy^4). These values are then added up over all the sets.

My strategy (and for now I'd just like to get it to work, it doesn't have to be optimized with multiple kernels or complex reductions, unless it really has to) is to have each thread deal with one set of coordinate pairs and calculate the values of all their corresponding monomials. Each block's shared memory holds all the monomial sums, and when the block is done, the first thread in the block adds the result to the global sum. Since each block's shared memory is accessed by all threads in all places, I'm using atomicAdd; same with the blocks and the global memory.

Unfortunately there still seems to be a race condition somewhere, since I different results every time I run the kernel.

If it helps, I'm currently using degree = 3 and omitting one of the variables, which means that in the code below, the innermost for loop (over evbl) doesn't do anything and just repeats 4 times. Indeed, the output of the kernel looks like this: 51502,55043.1,55043.1,51502,47868.5,47868.5,48440.5,48440.6,46284.7,46284.7,46284.7,46284.7,46034.3,46034.3,46034.3,46034.3,44972.8,44972.8,44972.8,44972.8,43607.6,43607.6,43607.6,43607.6,43011,43011,43011,43011,42747.8,42747.8,42747.8,42747.8,45937.8,45937.8,46509.9,46509.9,... and it's noticable that there is a (rough) pattern of 4-tuples. But everytime I run it the values are all very different.

Everything is in floats, but I'm on a 2.1 GPU and so that shouldn't be a problem. cuda-memcheck also reports no errors.

Can somebody with more CUDA experience give me some pointers how to track down the race condition here?

__global__ void kernel(...) {

  extern __shared__ float s_data[];

  // just use global memory for now
  // get threadID:
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if(idx >= nPairs) return;

  // ... do some calculations to get x/y...

  // calculate vdx, vdy and vyy
  float vdx = (x1 - x0)/(float)xheight;
  float vdy = (y1 - y0)/(float)xheight;
  float vyy =  0.5*(y0 + y1)/(float)xheight;


  const int offs1 = degree + 1;
  const int offs2 = offs1 * offs1;
  const int offs3 = offs2 * offs1;
  float sol = 1.0;

  // now calculate monomial results and store in shared memory

  for(int evdx = 0; evdx <= degree; evdx++) {
    for(int evdy = 0; evdy <= degree; evdy++) {
      for(int evyy = 0; evyy <= degree; evyy++) {
        for(int evbl = 0; evbl <= degree; evbl++) {
          s = powf(vdx, evdx) + powf(vdy, evdy) + powf(vyy, evyy);
          atomicAdd(&(s_data[evbl + offs1*evyy + offs2*evdy +
                offs3*evdx]), sol/1000.0 ); 

        }
      }
    }
  }

  // now copy shared memory to global
  __syncthreads();
  if(threadIdx.x == 0) {
    for(int i = 0; i < nMonomials; i++) {
      atomicAdd(&outmD[i], s_data[i]);
    }
  }
}

brano · Accepted Answer

You are using shared memory but you are never initializing it.

Race conditions despite atomicAdd functions (CUDA)?

Answers (1)

Related Questions