Maurex
Maurex

Reputation: 339

How to parallelise a nested loop with cross element dependencies in cuda?

I'm a beginner at cuda and am having some difficulties with it

If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.

edit: Have a device with cc 2.0

example:

// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
  double ai = a[i]; 
  for(j=0; j<1000; j++) {
    double aj = a[j];
    if (i == j)
      continue;
    result += ai - aj;  
  }
}

I have this at the moment:

//in host
int i;
for(i=0; i<1000; i++) {
    kernelFunc <<<2, 500>>> (i, d_a)
}

Is there a way to eliminate the serial loop?

Upvotes: 0

Views: 970

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151799

Something like this should work, I think:

__global__ void my_diffs(const double *a, double *b, const length){

  unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
  if (idx < length){
    double my_a = a[idx];
    double result = 0.0;
    for (int j=0; j<length; j++)
      result += my_a - a[j];
    b[idx] = result;
  }
}

(written in browser, not tested)

This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:

  1. use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
  2. create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.

If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.

Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.

You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:

my_diffs<<<8,128>>>(d_a, d_b, len);

The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.

If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.

Upvotes: 2

Related Questions