Reputation: 61
I have a question from the book "Professional CUDA C Programming"
It says about the GPU cache:
On the CPU, both memory loads and stores can be cached. However, on the GPU only memory load operations can be cached; memory store operations cannot be cached. [p142]
But on the other page, it says:
Global memory loads/stores are staged through caches. [p158]
I'm really confused whether the GPU cache the store or not.
If the first quote is correct, I understand it as GPU does not cache the write (the modification of data).
Thus, the write directly goes to the global memory, DRAM
Also, is it similar as "No-Write Allocate" of CPU??
I want some clear explanation from you guys... Thanks!
Upvotes: 2
Views: 119
Reputation: 3095
Even the ancient Fermi architecture (compute capability 2.x) cached stores in L2 according to its whitepaper (emphasis mine):
Fermi features a 768 KB unified L2 cache that services all load, store, and texture requests.
So the book seems to be talking about write-caching in L1 data cache specifically.
The short answer regarding write-caching in L1 is that since the Volta architecture (compute capability 7.0, newer than the book quoted by OP) stores can certainly be cached in L1 according to its whitepaper:
Enhanced L1 Data Cache and Shared Memory
[...] Prior NVIDIA GPUs only performed load caching, while GV100 introduces write-caching (caching of store operations) to further improve performance.
and the Turing Tuning Guide (compute capability 7.2 and 7.5)
Like Volta, Turing’s L1 can cache write operations (write-through).
For context: Pre-Volta architectures did not even consistently cache global loads in L1 data cache. I.e. some GPUs did it always, some needed special compilation flags to do it and some could not do it at all (although one could always use the smaller on-chip constant-cache for read-caching instead).
As all architectures since Volta/Turing feature the same on-chip unified data cache architecture and as there is nothing to be found in their tuning guides regarding write-caching in L1, one can safely assume that these newer architectures (Ampere, Ada, Hopper and Blackwell) also do global memory write-caching in L1.
For a deeper dive, take a look at the PTX ISA's cache operators (also available as CUDA C++ intrinsics called store functions using cache hints):
Operator Meaning .wb
Cache write-back all coherent levels. The default store instruction cache operation is st.wb
, which writes back cache lines of coherent cache levels with normal eviction policy. [...].cg
Cache at global level (cache in L2 and below, not L1). Use st.cg
to cache global store data only globally, bypassing the L1 cache, and cache only in the L2 cache..cs
Cache streaming, likely to be accessed once. The st.cs
store cached-streaming operation allocates cache lines with evict-first policy to limit cache pollution by streaming output data..wt
Cache write-through (to system memory). The st.wt
store write-through operation applied to a global System Memory address writes through the L2 cache.
This table is written confusingly (maybe to avoid describing architecture-specific behavior) but given the information we already have about L1 write-caching, the best interpretation I can come up with is that .wb
and .wt
concern how the write is handled by L2 while leaving L1 write-caching up to the particular architecture as L1 is not a coherent level and does probably not contain the necessary logic to implement write-back. As the description for .wb
does not concern the handling in non-coherent levels at all this is fine.
One can think of the L1 write-caching as always write-through (i.e. eagerly writing to the next level) but with invalidation of the L1 cache-line on pre-Volta architectures which is not what one typically thinks of for "write-through" but should still be fine.
.cg
explicitly disallows caching in L1, i.e. it should always reproduce the behavior of pre-Volta architectures. And .cs
does not mention the cache levels at all and just determines the eviction policy. This interpretation agrees with the one given at Making better sense of the PTX store caching modes (assuming a Volta or later architecture).
So stores are always cached in L2 while L1 write-caching depends on the GPU architecture and the actual store-instruction used.
Upvotes: 3
Reputation: 76724
Professoinal CUDA C programming is a book from 2014.
Back then Maxwell was just out which has compute capability 5 and The book does not describe Maxwell, only the older Kepler.
We are now at capability 10, 5 generations later, things have changed a bit.
The GPU has 4 levels of memory.
There are different kinds of memory on the GPU, that map to this hierarchy in different ways.
Let me break that down for writes only (for reads, see further down)
Shared memory is a L1 cache that is local to a multiprocessor (aka block). All writes to shared memory are only visible to threads in the same block. Shared memory is fixed size, only explicit reads/writes happen from it, so it never overflows and no cache evictions happen (even though hardware wise it is an L1 cache).
Local memory is NVidia's version of the stack. It is backed by global memory, but cached by the L1 cache. Because it is local to each block, writes to local memory do not go through to L2/global memory unless the L1 overflows.
Global memory is visible to all threads/blocks in the GPU. Any write to global memory must be visible to all blocks, so any writes go to the L2 and are only written to global memory if L2 overflows (aka write back).
Pinned memory is visible to both CPU and GPU, writes to pinned memory are cached in L2, but are write-through, because they must be visible to the CPU, which does not have access to the GPU L2. These writes do not end up in global memory, but in CPU memory. However the GPU has access to this memory via the PCI bus.
Constant memory (this is memory that can only be written by the CPU. It lives in global memory. Reads from it are cached by a dedicated L2/L1 constant cache in separate silicon from the normal L2/L1 cache.
Now with relation to reads.
Shared memory is read from the L1 cache dedicated to shared memory. It never comes from anywhere else.
Local memory. If the L1 cache has the item asked for it will be read from the L1 cache, because these items are private to each block. They are only read from L2 cache/global memory if the cache has evicted the item asked.
Global memory. These items are read from L1 cache, unless it is an atomic read, but any write to that global memory item (by another block) will invalidate the L1 cache and read it from the L2 cache instead. Only if the L2 cache has evicted that item (due to running out of space) will it be read from main memory.
Pinned memory. I'm not sure, I'll have to look it up.
Constant memory, because the GPU cannot alter constant memory it is always read from a separate constant L1 cache first, if not present, the L2 cache and lastly main memory. Repeated reads of an item will add the item to the constant L1 cache, evicting others. Every multiprocessor has its own (very small) L1 constant cache.
TL;DR Reads and Writes are definitely cached.
References:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
Note: there is also extended memory for multi-gpu systems, see: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#extended-gpu-memory
Upvotes: 0