Cuda Dot Product Failing for Non Multiples of 1024

Question

I'm just looking for some help here when it comes to calculating the dot product of two arrays.

Let's say I set the array size to 2500 and the max thread count per block to 1024.

In essence, I want to calculate the dot product of each block, and then sum the dot products in another kernel function. I calculate the number of blocks as such:

nblcks = (n + 1024 -1)/1024

So, nblcks = 3

This is my kernel function:

// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
    // store the product of a[i] and b[i] in shared memory
    // sum the products in shared memory
    // store the sum in c[blockIdx.x]

    __shared__ float s[ntpb];
    int tIdx = threadIdx.x;
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    //calc product
    if (i < n)
        s[tIdx] = a[i] * b[i];
    __syncthreads();

    for (int stride = 1; stride < blockDim.x; stride <<= 1) {
         if (tIdx % (2 * stride) == 0)
             s[tIdx] += s[tIdx + stride];
         __syncthreads();
     }

    if (threadIdx.x == 0){
        c[blockIdx.x] = s[0];
    }

}

I call the kernel:

dotProduct<<>>(d_a, d_b, d_c, n);

And everything works! Well, almost.

d_c, which has 3 elements - each one the dot product of the block is thrown off on the last element.

d_c[0] = correct
d_c[1] = correct
d_c[2] = some massive number of 10^18

Can someone point out why this is occurring? It only seems to work for multiples of 1024. So... 2048, 3072, etc... Am I iterating over null values or stack overflowing?

Thank you!

Edit:

 // host vectors
    float* h_a = new float[n];
    float* h_b = new float[n];
    init(h_a, n);
    init(h_b, n);
    // device vectors (d_a, d_b, d_c)
    float* d_a;
    float* d_b;
    float* d_c;
    cudaMalloc((void**)&d_a, n * sizeof(float));
    cudaMalloc((void**)&d_b, n * sizeof(float));
    cudaMalloc((void**)&d_c, nblocks * sizeof(float));

    // copy from host to device h_a -> d_a, h_b -> d_b
    cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);

Initialization of the array's are done in this function (n times):

void init(float* a, int n) {
    float f = 1.0f / RAND_MAX;
    for (int i = 0; i < n; i++)
        a[i] = std::rand() * f; // [0.0f 1.0f]
}

Cuda Dot Product Failing for Non Multiples of 1024

Answers (1)

Related Questions