Mmm Donuts
Mmm Donuts

Reputation: 10283

Cuda Dot Product Failing for Non Multiples of 1024

I'm just looking for some help here when it comes to calculating the dot product of two arrays.

Let's say I set the array size to 2500 and the max thread count per block to 1024.

In essence, I want to calculate the dot product of each block, and then sum the dot products in another kernel function. I calculate the number of blocks as such:

nblcks = (n + 1024 -1)/1024

So, nblcks = 3

This is my kernel function:

// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
    // store the product of a[i] and b[i] in shared memory
    // sum the products in shared memory
    // store the sum in c[blockIdx.x]

    __shared__ float s[ntpb];
    int tIdx = threadIdx.x;
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    //calc product
    if (i < n)
        s[tIdx] = a[i] * b[i];
    __syncthreads();

    for (int stride = 1; stride < blockDim.x; stride <<= 1) {
         if (tIdx % (2 * stride) == 0)
             s[tIdx] += s[tIdx + stride];
         __syncthreads();
     }

    if (threadIdx.x == 0){
        c[blockIdx.x] = s[0];
    }

}

I call the kernel:

dotProduct<<<nblocks, ntpb>>>(d_a, d_b, d_c, n);

And everything works! Well, almost.

d_c, which has 3 elements - each one the dot product of the block is thrown off on the last element.

d_c[0] = correct
d_c[1] = correct
d_c[2] = some massive number of 10^18

Can someone point out why this is occurring? It only seems to work for multiples of 1024. So... 2048, 3072, etc... Am I iterating over null values or stack overflowing?

Thank you!

Edit:

 // host vectors
    float* h_a = new float[n];
    float* h_b = new float[n];
    init(h_a, n);
    init(h_b, n);
    // device vectors (d_a, d_b, d_c)
    float* d_a;
    float* d_b;
    float* d_c;
    cudaMalloc((void**)&d_a, n * sizeof(float));
    cudaMalloc((void**)&d_b, n * sizeof(float));
    cudaMalloc((void**)&d_c, nblocks * sizeof(float));

    // copy from host to device h_a -> d_a, h_b -> d_b
    cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);

Initialization of the array's are done in this function (n times):

void init(float* a, int n) {
    float f = 1.0f / RAND_MAX;
    for (int i = 0; i < n; i++)
        a[i] = std::rand() * f; // [0.0f 1.0f]
}

Upvotes: 3

Views: 179

Answers (1)

talonmies
talonmies

Reputation: 72350

The basic problem here is that the sum reduction can only work correctly when you have a round power of two threads per block, with every entry in the shared memory initialised. That isn't a limitation in practice if you do something like this:

__global__ void dotProduct(const float* a, const float* b, float* c, int n){
    // store the product of a[i] and b[i] in shared memory
    // sum the products in shared memory
    // store the sum in c[blockIdx.x]

    __shared__ float s[ntpb];
    int tIdx = threadIdx.x;
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    //calc product
    s[tIdx] = 0.f;
    while (i < n) {
        s[tIdx] += a[i] * b[i];
        i += blockDim.x * gridDim.x;
    }
    __syncthreads();

    for (int stride = 1; stride < blockDim.x; stride <<= 1) {
         if (tIdx % (2 * stride) == 0)
             s[tIdx] += s[tIdx + stride];
         __syncthreads();
     }

    if (threadIdx.x == 0){
        c[blockIdx.x] = s[0];
    }
}

and run a power of two threads per block (ie. 32, 64, 128, 256, 512 or 1024). The while loop accumulates multiple values and stores that partial dot product in shared memory, with every entry containing either 0 or a valid partial sum, and then the reduction happens as normal. Instead of running as many blocks as the data size dictates, run only as many as will "fill" your GPU simultaneously (or one less than you think you require if the problem size is small). Performance will be improved as well at larger problem sizes.

If you haven't already seen it, here is a very instructive whitepaper written by Mark Harris from NVIDIA on step by step optimisation of the basic parallel reduction. I highly recommend reading it.

Upvotes: 2

Related Questions