Reputation: 31
What is the best way to do this in CUDA?
...
for(int i=0;i<size;++i)
for(int j=i+1;j<size ;++j)
temp_norm+=exp((train[i]-train[j])/tau);
Would this be equivalent?
...
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;
if (i>=size || j>=size) return;
if(j>i)
temp_norm+=exp((train[i]-train[j])/tau);
Any help would be much appreciated!
Upvotes: 1
Views: 1064
Reputation: 27809
How best to implement really depends on how big size
is. But assuming it is quite large, e.g. 1000 or more...
To do it the way you suggest, you would need to use atomicAdd(), which can be expensive if too many threads atomically add to the same address. A better way is probably to use a parallel reduction.
Check out the "reduction" sample in the NVIDIA CUDA SDK.
YMMV with the following since it is untested, and I don't know your data size, but something like this should work. Use the "reduction6" kernel from that example, but add your computation to the first while loop. Replace the initialization of i
and gridSize
with
unsigned int i = blockIdx.x*blockSize + threadIdx.x;
unsigned int gridSize = blockSize * gridDim.x;
Replace the while (i < n)
loop with
while (i < size)
{
for (unsigned int j = i+1; j<size; ++j)
mySum += exp((train[j]-train[i])/tau);
i += gridSize;
}
(Note, floating point arithmetic is non-associative, so the different order of operations in a parallel implementation may give you a slightly different answer than the sequential implementation. It may even give you a slightly more accurate answer due to the balanced tree reduction, depending on your input data.)
Upvotes: 2