Reputation: 31

Replacing a c for loop with cuda

What is the best way to do this in CUDA?

...
for(int i=0;i<size;++i)                                                                             
  for(int j=i+1;j<size ;++j)                                                                           
    temp_norm+=exp((train[i]-train[j])/tau);

Would this be equivalent?

...
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;

if (i>=size || j>=size) return;

if(j>i)
  temp_norm+=exp((train[i]-train[j])/tau);

Any help would be much appreciated!

Upvotes: 1

Answers (1)

harrism

Reputation: 27809

How best to implement really depends on how big size is. But assuming it is quite large, e.g. 1000 or more...

To do it the way you suggest, you would need to use atomicAdd(), which can be expensive if too many threads atomically add to the same address. A better way is probably to use a parallel reduction.

Check out the "reduction" sample in the NVIDIA CUDA SDK.

YMMV with the following since it is untested, and I don't know your data size, but something like this should work. Use the "reduction6" kernel from that example, but add your computation to the first while loop. Replace the initialization of i and gridSize with

unsigned int i = blockIdx.x*blockSize + threadIdx.x;
unsigned int gridSize = blockSize * gridDim.x;

Replace the while (i < n) loop with

while (i < size)
{
  for (unsigned int j = i+1; j<size; ++j)
      mySum += exp((train[j]-train[i])/tau);   
  i += gridSize;
}

(Note, floating point arithmetic is non-associative, so the different order of operations in a parallel implementation may give you a slightly different answer than the sequential implementation. It may even give you a slightly more accurate answer due to the balanced tree reduction, depending on your input data.)

Upvotes: 2

Replacing a c for loop with cuda

Answers (1)

Related Questions