mahmood
mahmood

Reputation: 24705

L1 cache in GPU

I see some similar terms while reading memoryt hierarchy of GPUs and since there were some architectural modifications in past versions, I don't know if they can be used together or has different meanings. The device is M2000 which is compute-compatibility 5.2.

Top level (closest to pipeline) is a unified L1/texture cache which is 24KB per SM. Is it unified for instructions and data too?

Below that, is L2 cache which is also know as shared memory which is shared by all SMs According to the ./deviceQuery, L2 size is 768KB. If that is an aggregate value, then each SM has 768KB/6=128KB. However, according to the programming guide, shared memory is 96KB.

What is constant memory then and where does it reside? There is no information about its size neither in deviceQuery nor nvprof metrics. Programming guide says:

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Texture and Surface Memory).

The global, constant, and texture memory spaces are persistent across kernel launches by the same application.

Below L2 is the global memory which is known as device memory which can be 2GB, 4GB and ...

Upvotes: 5

Views: 7456

Answers (1)

Greg Smith
Greg Smith

Reputation: 11529

The NVIDIA GPU Architecture has the following access paths. GPU may have additional caches within the hierarchy presented below.

  • Path for Global, Local Memory
    • (CC3.*) L1 -> L2
    • (CC5.-6.) L1TEX -> L2
    • (CC7.*) L1TEX (LSU) -> L2
  • Path for Surface, Texture (CC5./6.)
    • (CC < 5) TEX
    • (CC5.-6.) L1TEX -> L2
    • (CC7.*) L1TEX (TEX) -> L2
  • Path for Shared
    • (CC3.*) L1
    • (CC5.-6.) SharedMemory
    • (CC7.*) L1TEX (LSU)
  • Path for Immediate Constant
    • ... c[bank][offset] -> IMC - Immediate Constant Cache -> L2 Cache
  • Path for Indexed Constant
    • LDC Rd, c[bank][offset] -> IDC - Indexed Constant Cache -> L2 Cache
  • Path for Instruction
    • ICC - Instruction Cache -> L2

The NVIDIA CUDA Profilers (Nsight Compute, Nvidia Visual Profiler, and Nsight VSE CUDA Profiler) have high level diagrams of the the memory hierarchy to help you understand how logical requests map to the hardware.

CC3.* Memory Hierarchy enter image description here

For CC5./6. there are two unified L1TEX caches per SM. Each L1/TEX unit services 1 SM partition. Each SM partition has two sub-partitions (2 warp schedulers). The SM contains a separate RAM and data path for shared memory. The L1TEX unit services neither instruction fetches or constant data loads (via c[bank][offset]). Instruction fetches and constant loads are handled through separate cache hierarchies (see above). The CUDA programming model also supports read-only (const) data access through the L1TEX via global memory address space.

L2 cache is shared by all engines in the GPU including but not limited to SMs, copy engines, video decoders, video encoders, and display controllers. The L2 cache is not partitioned by client. L2 is not referred to as shared memory. In NVIDIA GPUs shared memory is a RAM local to the SM that supports efficient non-linear access.

Global memory is a virtual memory address that may include:

  • dedicated memory on the chip referred to as device memory, video memory, or frame buffer depending on the context.
  • pinned system memory
  • non-pinned system memory through unified virtual memory
  • peer memory

Upvotes: 14

Related Questions