CUDA Programming - Shared memory configuration

Question

Could you please explain the differences between using both "16 KB shared memory + 48K L1 cache" or "48 KB shared memory + 16 KB L1 cache" in CUDA programming? What should I expect in time execution? When could I expect less GPU runtime?

Tom · Accepted Answer

On Fermi and Kepler nVIDIA GPUs, each SM has a 64KB chunk of memory which can be configured as 16/48 or 48/16 shared memory/L1 cache. Which mode you use depends on how much use of shared memory your kernel makes. If your kernel uses a lot of shared memory then you would probably find that configuring it as 48KB shared memory allows higher occupancy and hence better performance.

On the other hand, if your kernel does not use shared memory at all, or if it only uses a very small amount per thread, then you would configure it as 48KB L1 cache.

How much a "very small amount" is is probably best illustrated with the occupancy calculator which is a spreadsheet included with the CUDA Toolkit. This spreadsheet allows you to investigate the effect of different shared memory per block and different block sizes.

Update:

The occupancy calculator spreadsheet is not distributed anymore with recent versions of the CUDA Toolkit. It's functionality has been integrated into the Nsight Compute profiler which is also part of the CUDA Toolkit.

CUDA Programming - Shared memory configuration

Answers (1)

Update:

Related Questions