Reputation: 453
Modern GPU architectures have both L1 cache and L2 cache. It is well-known that L1 cache is much faster than global memory. However, the speed of L2 cache is less clear in the CUDA documentation. I looked up the CUDA documentation, but can only find that the latency of global memory operation is about 300-500 cycles while L1 cache operation takes only about 30 cycles. Can anyone give the speed of L2 cache? Such information may be very useful, since the programming will not focus on optimizing the use of L2 cache if it is not very fast compared with global memory. If the speed is different for different architectures, I just want to focus on the latest architecture, such as NVIDIA Titan RTX 3090 (Compute Capability 8.6) or NVIDIA Telsa V100 (Compute Capability 7.0).
Thank you!
Upvotes: 4
Views: 4086
Reputation: 151869
There are at least 2 figures of merit commonly used when discussing GPU memory: latency and bandwidth. From a latency perspective, this number is not published by NVIDIA (that I know of) and the usual practice is to discover it with careful microbenchmarking.
From a bandwidth perspective, AFAIK this number is also not published by NVIDIA (for L2 cache), but it should be fairly easy to discover it with a fairly simple test case of a copy kernel. We can estimate the bandwidth of global memory simply by ensuring that our copy kernel uses a copy footprint that is much larger than the published L2 cache size (6MB for V100), whereas we can estimate the bandwidth of L2 by keeping our copy footprint smaller than that.
Such a code (IMO) is fairly trivial to write:
$ cat t44.cu
template <typename T>
__global__ void k(volatile T * __restrict__ d1, volatile T * __restrict__ d2, const int loops, const int ds){
for (int i = 0; i < loops; i++)
for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
if (i&1) d1[j] = d2[j];
else d2[j] = d1[j];
}
const int dsize = 1048576*128;
const int iter = 64;
int main(){
int *d;
cudaMalloc(&d, dsize);
// case 1: 32MB copy, should exceed L2 cache on V100
int csize = 1048576*8;
k<<<80*2, 1024>>>(d, d+csize, iter, csize);
// case 2: 2MB copy, should fit in L2 cache on V100
csize = 1048576/2;
k<<<80*2, 1024>>>(d, d+csize, iter, csize);
cudaDeviceSynchronize();
}
$ nvcc -o t44 t44.cu
$ nvprof ./t44
==53310== NVPROF is profiling process 53310, command: ./t44
==53310== Profiling application: ./t44
==53310== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 6.9032ms 2 3.4516ms 123.39us 6.7798ms void k<int>(int volatile *, int volatile *, int, int)
API calls: 89.47% 263.86ms 1 263.86ms 263.86ms 263.86ms cudaMalloc
4.45% 13.111ms 8 1.6388ms 942.75us 2.2322ms cuDeviceTotalMem
3.37% 9.9523ms 808 12.317us 186ns 725.86us cuDeviceGetAttribute
2.34% 6.9006ms 1 6.9006ms 6.9006ms 6.9006ms cudaDeviceSynchronize
0.33% 985.49us 8 123.19us 85.864us 180.73us cuDeviceGetName
0.01% 42.668us 8 5.3330us 1.8710us 22.553us cuDeviceGetPCIBusId
0.01% 34.281us 2 17.140us 6.2880us 27.993us cudaLaunchKernel
0.00% 8.0290us 16 501ns 256ns 1.7980us cuDeviceGet
0.00% 3.4000us 8 425ns 217ns 876ns cuDeviceGetUuid
0.00% 3.3970us 3 1.1320us 652ns 2.0020us cuDeviceGetCount
$
Based on the profiler output, we can estimate global memory bandwidth as:
2*64*32MB/6.78ms = 604GB/s
we can estimate L2 bandwidth as:
2*64*2MB/123us = 2.08TB/s
Both of these are rough measurements (I'm not doing careful benchmarking here), but bandwidthTest
on this V100 GPU reports a device memory bandwidth of ~700GB/s, so I believe the 600GB/s number is "in the ballpark". If we use that to judge that the L2 cache measurement is in the ballpark, then we might guess that the L2 cache may be ~3-4x faster than global memory in some circumstances.
Upvotes: 10