Tensorflow CUDA Reduction Op not fully reducing

Question

A part of a TensorFlow r1.5 Operation that I have written in C++ and CUDA involves a reduction over a Tensor. I have implemented the simple interleaved reduction algorithm, as described here. However, it appears that not the entire buffer is being reduced.

The implementation of the block reduction is as follows

template
__global__
void blockReduceDevice(const T *buffer, T *out, size_t len) {
    const size_t tIdx = threadIdx.x;
    const size_t bIdx = blockIdx.x;
    const size_t bDim = blockDim.x;
    const size_t idx = bIdx * bDim + tIdx;

    //To allow templated, dynamic shared memory, we set the 
    //smem to be uchar and reinterpret as templated type.
    extern __shared__ __align__(sizeof(T)) unsigned char buffReduce[];
    __syncthreads();

    //Set contribution of this thread. 0 if out of bounds.
    T *reduce = reinterpret_cast(buffReduce);
    reduce[tIdx] = (idx >= len) ? 0.0 : buffer[idx];
    __syncthreads();

    //Block reduce.
#pragma unroll
    for (int i = bDim >> 1; i >= 1; i >>= 1) {
        if(tIdx < i) {
            reduce[tIdx] += reduce[tIdx + i];
        }
        __syncthreads();
    }

    if(tIdx == 0) {
        out[bIdx] = reduce[tIdx];
    }
}

The above kernel is invoked as follows

template
void testReduce(const T *buffer, T *blockVals, const GPUDevice &dev, size_t len) {
    //Get CUDA stream.
    const cudaStream_t &stream = dev.stream();

    //Get launch configuration for reduction operation.
    const auto reduceConfig = tensorflow::GetCudaLaunchConfig(len, dev);
    const size_t blocks = reduceConfig.block_count;
    const size_t threads = reduceConfig.thread_per_block;
    const size_t shared = threads * sizeof(T);


    //Reset buffer to known value.
    std::vector knownValsHost(len, 1.0);
    cudaMemcpyAsync(buffer, &knownValsHost[0], len * sizeof(T), cudaMemcpyHostToDevice, stream);
    CUSAFE(cudaStreamSynchronize(stream));

    //Reset output to nought.
    std::vector tmp(blocks, 0.0);
    cudaMemcpyAsync(blockVals, &tmp[0], blocks * sizeof(T), cudaMemcpyHostToDevice, stream);
    CUSAFE(cudaStreamSynchronize(stream));

    //Reduce on the GPU.
    blockReduceDevice<<>>(buffer, blockVals, len);
    CUSAFE(cudaPeekAtLastError());
    CUSAFE(cudaStreamSynchronize(stream));

    //Further reduce on the CPU.
    std::vector blockValsHost(blocks, 0.0);
    cudaMemcpyAsync(&blockValsHost[0], blockVals, blocks * sizeof(T), cudaMemcpyDeviceToHost, stream);
    CUSAFE(cudaStreamSynchronize(stream));
    const T resGPU = std::accumulate(blockValsHost.begin(), blockValsHost.end(), static_cast(0));

    //Get result when copying buffer to CPU memory and reducing.
    std::vector bufferHost(len, 0.0);
    cudaMemcpyAsync(&bufferHost[0], buffer, len * sizeof(T), cudaMemcpyDeviceToHost, stream);
    CUSAFE(cudaStreamSynchronize(stream));
    const T resCPU = std::accumulate(bufferHost.begin(), bufferHost.end(), static_cast(0));

    //Print some output for diagnostics.
    std::cout << "Length: " << len << std::endl;
    std::cout << "Num CUDA Blocks: " << blocks << std::endl;
    std::cout << "Num CUDA Threads Per Block: " << threads << std::endl;
    std::cout << "GPU Result: " << resGPU << std::endl;
    std::cout << "CPU Result: " << resCPU << std::endl;
}

In the above test case, the following output is given, where all buffer entries are set to 1.0

Length: 32768
Num CUDA Blocks: 10
Num CUDA Threads Per Block: 1024
GPU Result: 10240
CPU Result: 32768

As can be seen, the CPU reduction using std::accumulate works as expected(as len == resCPU). This leads me to believe that the CUDA kernel is not being fully executed as blocks * threads != len.

The TensorFlow documentation states here that CUDA kernel launch configurations should be obtained using the tensorflow/core/util/cuda_kernel_helper.h header, which may be found here.

For what reason would TensorFlow provide me with a launch configuration that does not execute the appropriate number of threads?

I receive similar results when setting launch configuration parameters manually also.

talonmies · Accepted Answer

For what reason would TensorFlow provide me with a launch configuration that does not execute the appropriate number of threads?

I would guess because Tensorflow expects that kernels it will run conform to a design principle which your kernel does not. The execution parameters Tensorflow returns will limit the thread count to the maximum number of concurrent threads which can theoretically run on a given device. See here for full details.

Your job is to write a kernel which conforms to that design pattern, basically by being capable of processing multiple input data points per thread. In practice this means trivially modifying your kernel to something like this:

template
__global__
void blockReduceDevice(const T *buffer, T *out, size_t len) {
    const size_t tIdx = threadIdx.x;
    const size_t bIdx = blockIdx.x;
    const size_t bDim = blockDim.x;
    const size_t idx = bIdx * bDim + tIdx;
    const size_t stride = gridDim.x * blockDim.x

    //To allow templated, dynamic shared memory, we set the 
    //smem to be uchar and reinterpret as templated type.
    extern __shared__ __align__(sizeof(T)) unsigned char buffReduce[];
    // cargo cult : __syncthreads();

    //Set contribution of this thread. 0 if out of bounds.
    T *reduce = reinterpret_cast(buffReduce);
    T threadsum = T(0);
    for(; idx < len; idx += stride)
        threadsum += buffer[idx];

    // store thread local partial reduction to shared memory
    reduce[tIdx] = threadsum;
    __syncthreads();

    // etc

[warning: obviously never compiled or run, use at own risk]

Basically, this design will have each thread attempt to iterate through as many input data points as required to process all the input data in a fashion which ensures memory coalescing.

Tensorflow CUDA Reduction Op not fully reducing

Answers (1)

Related Questions