Reputation: 6149
From what I understand, the variable shared_mem_size
in the code below is the amount of shared memory to be allocated for each block. But what happens when total shared memory size (BLOCKS_GRID * shared_mem_size
) exceeds the overall available shared memory capacity of the hardware?
__global__ void KernelFunction ()
{
extern __shared__ s_data[];
}
int main ()
{
shared_mem_size = THREADS_BLOCK * sizeof(T);
KernelFunction <<<BLOCK_GRID,THREADS_BLOCK,shared_mem_size>>> ();
}
Upvotes: 0
Views: 294
Reputation: 151879
In many CUDA programs, not all of the blocks are executing at the same time. As long as the requested shared memory for a single block does not exceed what is available on the SM (either 16KB or 48KB with current hardware), then at least one block can execute on that SM.
Let's say I have a grid of 1024 blocks. Let's say each block needs 32KB of shared memory.
In that case you can have one threadblock resident (i.e. executing) on each SM. If I have a machine with 4 SMs, then 4 threadblocks will be executing. As those threadblocks finish, then more can be launched, until all 1024 thread blocks are consumed.
If a single threadblock exceeds the available shared memory (e.g. I request 32KB but the SM is configured to only offer 16KB) then I will get a kernel launch error/CUDA API error.
Upvotes: 4