spialla
spialla

Reputation: 3

CUDA optimization: nested loops

I am trying to import CUDA in this code:

double square=0;
for(int j=0;j<width; j++) {
  double Up=0,Down=0;
  for(int i=0;i<height; i++) {
    if(array1[i]>0 && array2[i]>0){
      square = source[i*width+j];
      square = square*square;
      Up   += square*array2[i]/array1[i];
      Down += square;
    }
  }
  if(Down>0){
    out[j] *= (1.+(Up/Down-1.));
  }
}

In the first attempt I reduced the first for loop. (works well)

int j = blockDim.x * blockIdx.x + threadIdx.x;

double Up=0, Down=0, square=0;
if (j<width) {
  for(int i=0;i<height;i++) {
    if(array1[i]>0 && array2[i]>0){
      square = source[i*width+j];
      square = square*square;
      Up   += square*array2[i]/array1[i];
      Down += square;
    }
  }
  if(Down>0){
    out[j] *= (1.+(Up/Down-1.));
  }
}

I would also reduce the second for loop, I tried it with a 2D grid does not work. This is the kernel:

int j = blockDim.x * blockIdx.x + threadIdx.x;
int i = blockDim.y * blockIdx.y + threadIdx.y;
int offset = j + i * blockDim.x * gridDim.x;

double Up[width],Down[width], square[height];
if (j>=width && i>=height) return;

if(array1[i]>0 && array2[i]>0){
  square[i] = source[offset]*source[offset];
  Up[j]   += square[i]*array2[i]/array1[i];
  Down[j] += square[i];
}
if(Down[j]>0){
  out[j] *= (1.+(Up[j]/Down[j]-1.));
}

and this is the kernel call:

dim3 blocks(32,32);
dim3 grid(width/32,height/32);
kernel <<< grid, blocks >>> (...);
cudaDeviceSynchronize();

... what is the error? there are more efficient solutions? (I could use the dynamic parallelism?)

Thanks a lot!

Upvotes: 0

Views: 883

Answers (1)

Roger Dahl
Roger Dahl

Reputation: 15724

In your last kernel, it looks like you intended the array of Up, Down and square to persist between threads, but those arrays are thread local, so the data they contain is not shared between threads. Unfortunately, your approach wouldn't work even if they were shared between threads.

In your inner loop, the current round of the loop uses data that was calculated in the previous round. It is not entirely trivial to parallelize such loops, and sometimes it cannot be done at all. In your case, a simple solution would be to use atomic operators to increase the Up and Down counters, but it wouldn't be efficient because the atomic operators cause implicit serialization of the operations.

You should probably look into solving this with existing parallel primitives, such as prefix-sums, that have already been optimized. For instance, those in CUB or Thrust.

Upvotes: 1

Related Questions