CUDA shared vs global memory, possible speedup

Question

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.

The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?

Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:

__shared__ double cache[6144];

__global__ void kernel(double *buf_out, double a, double b, double c)
{
    for(...)
    {
       // Perform calculation on shared memory
       cache[threadIdx.x * 192] = ...
    }

    // Write result to global memory
    buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}

My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?

MWE added: This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds. Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.

#include "cuda_runtime.h"

#include 
#include 

#define _USE_MATH_DEFINES
#include 
    

__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
    double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{   
    double z, y, y_max;

    unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
    
    double Z = tid * SCREEN_STEP_SIZE, Y = 0;

    double temp = WAVE_NUMBER / SCREEN_DIST;


    // Make sure the per-thread accumulator is zero before we begin
    buf[tid] = 0;

    for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
    {
        y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);

        for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
        {
            buf[tid] += cos(temp * (Y * y + Z * z));
        }
    }   
}


int main(void)
{
    double *dev_mem;
    double *buf = NULL;
    cudaError_t cudaStatus;

    unsigned int screen_elems = 1000;


    if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
    {
        printf("Could not allocate memory...");
        return -1;
    }

    memset(buf, 0, screen_elems * sizeof(double));


    if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
    {
        printf("cudaMalloc failed with code %u", cudaStatus);
        return cudaStatus;
    }


    kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);

    cudaDeviceSynchronize();

    if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
    {
        printf("cudaMemcpy failed with code %u", cudaStatus);
        return cudaStatus;
    }


    cudaFree(dev_mem);

    cudaDeviceReset();

    free(buf);

    return 0;
}

The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:

__shared__ double cache[1000];


__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
    double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{   
    double z, y, y_max;

    unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
    
    double Z = tid * SCREEN_STEP_SIZE, Y = 0;

    double temp = WAVE_NUMBER / SCREEN_DIST;


    // Make sure the per-thread accumulator is zero before we begin
    cache[tid] = 0;

    for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
    {
        y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);

        for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
        {
            cache[tid] += cos(temp * (Y * y + Z * z));
        }
    }   

    buf[tid] = cache[tid];
}

The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?

CUDA shared vs global memory, possible speedup

Answers (1)

Related Questions