Reputation: 169
I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.
So here is the kernel code, and I'll try to explain it a bit clearer with the variables.
__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
if(index_in < cLength)
{
sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
//__threadfence();
__threadfence_block();
}
}
First off, the number of connections in the network is cLength
. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray
contains that information. So index i
of sourceTargetArray
is the source neuron index of connection i
, and target neuron index of connection i
. The weightArray
contains the weight information (so index i
of weightArray
corresponds to connection i
).
As you can see, SumArray
is where I'm storing the sums. So kernel increments the sumArray
(at target neuron index of connection i
) by the absolute value of the weight of connection i
. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.
The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).
You can see that I added __threadfence()
(and __threadfence_block()
in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?
EDIT: For reference, __threadfence()
usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions
Upvotes: 1
Views: 2768
Reputation: 13723
You need to do a reduction.
Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads
Now reduce the resulting sub totals by adding successive neighboring subtotals:
int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex] + 1;
__syncthreads;
i /= 2;
}
}
The following deck explains this in some detail:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
Sample code for this is here:
http://www.nvidia.com/object/cuda_sample_data-parallel.html
It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).
There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.
Upvotes: 3
Reputation: 60065
+=
is not atomical => not thread safe. Use atomicAdd.
Also you should avoid writing to same memory cell. Problem is that these calls will be serialized, threads will stand in line and wait for each other. If you can't avoid this operation try to break your algorithm into two phases: individual computation and merging. Parallel merging can be implemented very efficiently.
Upvotes: 4