Reputation: 24705
I see some similar terms while reading memoryt hierarchy of GPUs and since there were some architectural modifications in past versions, I don't know if they can be used together or has different meanings. The device is M2000 which is compute-compatibility 5.2.
Top level (closest to pipeline) is a unified L1/texture cache which is 24KB per SM. Is it unified for instructions and data too?
Below that, is L2 cache which is also know as shared memory which is shared by all SMs According to the ./deviceQuery, L2 size is 768KB. If that is an aggregate value, then each SM has 768KB/6=128KB. However, according to the programming guide, shared memory is 96KB.
What is constant memory then and where does it reside? There is no information about its size neither in deviceQuery nor nvprof metrics. Programming guide says:
There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Texture and Surface Memory).
The global, constant, and texture memory spaces are persistent across kernel launches by the same application.
Below L2 is the global memory which is known as device memory which can be 2GB, 4GB and ...
Upvotes: 5
Views: 7456
Reputation: 11529
The NVIDIA GPU Architecture has the following access paths. GPU may have additional caches within the hierarchy presented below.
The NVIDIA CUDA Profilers (Nsight Compute, Nvidia Visual Profiler, and Nsight VSE CUDA Profiler) have high level diagrams of the the memory hierarchy to help you understand how logical requests map to the hardware.
For CC5./6. there are two unified L1TEX caches per SM. Each L1/TEX unit services 1 SM partition. Each SM partition has two sub-partitions (2 warp schedulers). The SM contains a separate RAM and data path for shared memory. The L1TEX unit services neither instruction fetches or constant data loads (via c[bank][offset]). Instruction fetches and constant loads are handled through separate cache hierarchies (see above). The CUDA programming model also supports read-only (const) data access through the L1TEX via global memory address space.
L2 cache is shared by all engines in the GPU including but not limited to SMs, copy engines, video decoders, video encoders, and display controllers. The L2 cache is not partitioned by client. L2 is not referred to as shared memory. In NVIDIA GPUs shared memory is a RAM local to the SM that supports efficient non-linear access.
Global memory is a virtual memory address that may include:
Upvotes: 14