CUDA optimization: nested loops

Question

I am trying to import CUDA in this code:

double square=0;
for(int j=0;j0 && array2[i]>0){
      square = source[i*width+j];
      square = square*square;
      Up   += square*array2[i]/array1[i];
      Down += square;
    }
  }
  if(Down>0){
    out[j] *= (1.+(Up/Down-1.));
  }
}

In the first attempt I reduced the first for loop. (works well)

int j = blockDim.x * blockIdx.x + threadIdx.x;

double Up=0, Down=0, square=0;
if (j0 && array2[i]>0){
      square = source[i*width+j];
      square = square*square;
      Up   += square*array2[i]/array1[i];
      Down += square;
    }
  }
  if(Down>0){
    out[j] *= (1.+(Up/Down-1.));
  }
}

I would also reduce the second for loop, I tried it with a 2D grid does not work. This is the kernel:

int j = blockDim.x * blockIdx.x + threadIdx.x;
int i = blockDim.y * blockIdx.y + threadIdx.y;
int offset = j + i * blockDim.x * gridDim.x;

double Up[width],Down[width], square[height];
if (j>=width && i>=height) return;

if(array1[i]>0 && array2[i]>0){
  square[i] = source[offset]*source[offset];
  Up[j]   += square[i]*array2[i]/array1[i];
  Down[j] += square[i];
}
if(Down[j]>0){
  out[j] *= (1.+(Up[j]/Down[j]-1.));
}

and this is the kernel call:

dim3 blocks(32,32);
dim3 grid(width/32,height/32);
kernel <<< grid, blocks >>> (...);
cudaDeviceSynchronize();

... what is the error? there are more efficient solutions? (I could use the dynamic parallelism?)

Thanks a lot!

CUDA optimization: nested loops

Answers (1)

Related Questions