user2286339
user2286339

Reputation: 194

CUDA shared vs global memory, possible speedup

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.

The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?

Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:

__shared__ double cache[6144];

__global__ void kernel(double *buf_out, double a, double b, double c)
{
    for(...)
    {
       // Perform calculation on shared memory
       cache[threadIdx.x * 192] = ...
    }

    // Write result to global memory
    buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}

My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?


MWE added: This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds. Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.

#include "cuda_runtime.h"

#include <iostream>
#include <stdio.h>

#define _USE_MATH_DEFINES
#include <math.h>
    

__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
    double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{   
    double z, y, y_max;

    unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
    
    double Z = tid * SCREEN_STEP_SIZE, Y = 0;

    double temp = WAVE_NUMBER / SCREEN_DIST;


    // Make sure the per-thread accumulator is zero before we begin
    buf[tid] = 0;

    for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
    {
        y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);

        for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
        {
            buf[tid] += cos(temp * (Y * y + Z * z));
        }
    }   
}


int main(void)
{
    double *dev_mem;
    double *buf = NULL;
    cudaError_t cudaStatus;

    unsigned int screen_elems = 1000;


    if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
    {
        printf("Could not allocate memory...");
        return -1;
    }

    memset(buf, 0, screen_elems * sizeof(double));


    if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
    {
        printf("cudaMalloc failed with code %u", cudaStatus);
        return cudaStatus;
    }


    kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);

    cudaDeviceSynchronize();

    if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
    {
        printf("cudaMemcpy failed with code %u", cudaStatus);
        return cudaStatus;
    }


    cudaFree(dev_mem);

    cudaDeviceReset();

    free(buf);

    return 0;
}

The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:

__shared__ double cache[1000];


__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
    double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{   
    double z, y, y_max;

    unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
    
    double Z = tid * SCREEN_STEP_SIZE, Y = 0;

    double temp = WAVE_NUMBER / SCREEN_DIST;


    // Make sure the per-thread accumulator is zero before we begin
    cache[tid] = 0;

    for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
    {
        y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);

        for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
        {
            cache[tid] += cos(temp * (Y * y + Z * z));
        }
    }   

    buf[tid] = cache[tid];
} 

The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?

Upvotes: 0

Views: 2121

Answers (1)

einpoklum
einpoklum

Reputation: 132148

Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.

In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.

If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.

Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.

Upvotes: 1

Related Questions