Coding a CUDA Kernel that has many threads writing to the same index?

Question

I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.

So here is the kernel code, and I'll try to explain it a bit clearer with the variables.

__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
 if(index_in < cLength)
 {

  sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
  //__threadfence();
  __threadfence_block();

 }

}

First off, the number of connections in the network is cLength. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray contains that information. So index i of sourceTargetArray is the source neuron index of connection i, and target neuron index of connection i. The weightArray contains the weight information (so index i of weightArray corresponds to connection i).

As you can see, SumArray is where I'm storing the sums. So kernel increments the sumArray (at target neuron index of connection i) by the absolute value of the weight of connection i. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.

The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).

You can see that I added __threadfence() (and __threadfence_block() in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?

EDIT: For reference, __threadfence() usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions

Ade Miller · Accepted Answer

You need to do a reduction.

Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads

Now reduce the resulting sub totals by adding successive neighboring subtotals:

int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
    if (cacheIndex < i)
        cache[cacheIndex] += cache[cacheIndex] + 1;
        __syncthreads;
        i /= 2;
    }
}

The following deck explains this in some detail:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

Sample code for this is here:

http://www.nvidia.com/object/cuda_sample_data-parallel.html

It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).

There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.

Coding a CUDA Kernel that has many threads writing to the same index?

Answers (2)

Related Questions