CUDA kernel launch requiring to dynamically allocate an amount of shared memory exceeding the hardware capacity

Question

From what I understand, the variable shared_mem_size in the code below is the amount of shared memory to be allocated for each block. But what happens when total shared memory size (BLOCKS_GRID * shared_mem_size) exceeds the overall available shared memory capacity of the hardware?

__global__ void KernelFunction ()
{
    extern __shared__ s_data[];
}

int main ()
{
    shared_mem_size = THREADS_BLOCK * sizeof(T);
    KernelFunction <<>> ();
}

Robert Crovella · Accepted Answer

In many CUDA programs, not all of the blocks are executing at the same time. As long as the requested shared memory for a single block does not exceed what is available on the SM (either 16KB or 48KB with current hardware), then at least one block can execute on that SM.

Let's say I have a grid of 1024 blocks. Let's say each block needs 32KB of shared memory.

In that case you can have one threadblock resident (i.e. executing) on each SM. If I have a machine with 4 SMs, then 4 threadblocks will be executing. As those threadblocks finish, then more can be launched, until all 1024 thread blocks are consumed.

If a single threadblock exceeds the available shared memory (e.g. I request 32KB but the SM is configured to only offer 16KB) then I will get a kernel launch error/CUDA API error.

CUDA kernel launch requiring to dynamically allocate an amount of shared memory exceeding the hardware capacity

Answers (1)

Related Questions