warp shuffling to reduction of arrays with any length

Question

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.

__inline__ __device__
float warpReduceSum(float val) {
    int warpSize =32;
    for (int offset = warpSize/2; offset > 0; offset /= 2)
        val += __shfl_down(val, offset);
    return val;
}

__inline__ __device__
float blockReduceSum(float val) {
    static __shared__ int shared[32]; // Shared mem for 32 partial sums
    int lane = threadIdx.x % warpSize;
    int wid = threadIdx.x / warpSize;
    val = warpReduceSum(val);         // Each warp performs partial reduction
    if (lane==0) 
        shared[wid]=val;              // Write reduced value to shared memory
    __syncthreads();                  // Wait for all partial reductions
    //read from shared memory only if that warp existed
    val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
    if (wid==0) 
        val = warpReduceSum(val);     // Final reduce within first warp
    return val;
}

I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.

This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?

ptrendx · Accepted Answer

First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).

To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:

__global__ void kernel(float * A, float * B, int N) {
    float psum = 0;
    if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
        psum = A[threadIdx.x] * B[threadIdx.x];
    blockReduceSum(psum);
    //The rest of computation
}

That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

warp shuffling to reduction of arrays with any length

Answers (1)

Related Questions