Reputation: 89

Parallel reduction #5 unroll the last warp

I'm studying the reduction from the well-known slides by Mark Harris. In particular, I implemented the optimization step #5 but I obtain a wrong result: 17 instead of 41. I used the same number sequence shown in the slides. I omitted in the code the "extern" for the shared array, because the kernel and the host code are in the same .cu file.

#include <stdio.h>
#include <cuda_runtime.h>

#define THREAD_PER_BLOCK 16

__global__ void reduce5(int *g_idata, int *g_odata) {
    __shared__ int sdata[THREAD_PER_BLOCK];
    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
    sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
    __syncthreads();
    // do reduction in shared mem
    for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }

    if (tid < 32)
    {
        sdata[tid] += sdata[tid + 32];
        sdata[tid] += sdata[tid + 16];
        sdata[tid] += sdata[tid + 8];
        sdata[tid] += sdata[tid + 4];
        sdata[tid] += sdata[tid + 2];
        sdata[tid] += sdata[tid + 1];
    }
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

int main()
{
    int inputLength=16;
    int hostInput[16]={10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2};
    int hostOutput=0;
    int *deviceInput;
    int *deviceOutput;

    cudaMalloc((void **)&deviceInput, inputLength * sizeof(int));
    cudaMalloc((void **)&deviceOutput, sizeof(int));

    cudaMemcpy(deviceInput, hostInput, inputLength * sizeof(int),cudaMemcpyHostToDevice);

    reduce5<<<1,16>>>(deviceInput, deviceOutput);

    cudaDeviceSynchronize();

    cudaMemcpy(&hostOutput, deviceOutput,sizeof(int), cudaMemcpyDeviceToHost);

    printf("%d\n",hostOutput);

    cudaFree(deviceInput);
    cudaFree(deviceOutput);

    return 0;
}

Upvotes: 0

Answers (2)

Leonard Hung

Reputation: 21

I encountered the same problem and found it turns out to be the threads are not actually in synchronization if the variable is not declared as volatile.

Simply add volatile when declaring sdata the problem will be solved. Please refer to my post: cuda Threads in A Warp appear to be not in synchronization

Upvotes: 1

talonmies

Reputation: 72349

In this code THREAD_PER_BLOCK must be a multiple of 32, and at least 64. The length of the input data must also be twice the product of block and grid sizes.

You don't see it (because you aren't performing any sort of error checking), but the thread and warp reductions will be failing because of out-of-bounds shared memory and global memeory access.

Also note that extern __shared__ has nothing to do with whether the kernel and other code are in the same file. That denotes that shared memory for that variable will be dynamically allocated at runtime, rather than statically at compile time. The size of the allocation is passed as the third argument in the kernel launch syntax.

Upvotes: 2

Parallel reduction #5 unroll the last warp

Answers (2)

Related Questions