Is memory operation for L2 cache significantly faster than global memory for NVIDIA GPU?

Question

Modern GPU architectures have both L1 cache and L2 cache. It is well-known that L1 cache is much faster than global memory. However, the speed of L2 cache is less clear in the CUDA documentation. I looked up the CUDA documentation, but can only find that the latency of global memory operation is about 300-500 cycles while L1 cache operation takes only about 30 cycles. Can anyone give the speed of L2 cache? Such information may be very useful, since the programming will not focus on optimizing the use of L2 cache if it is not very fast compared with global memory. If the speed is different for different architectures, I just want to focus on the latest architecture, such as NVIDIA Titan RTX 3090 (Compute Capability 8.6) or NVIDIA Telsa V100 (Compute Capability 7.0).

Thank you!

Robert Crovella · Accepted Answer

There are at least 2 figures of merit commonly used when discussing GPU memory: latency and bandwidth. From a latency perspective, this number is not published by NVIDIA (that I know of) and the usual practice is to discover it with careful microbenchmarking.

From a bandwidth perspective, AFAIK this number is also not published by NVIDIA (for L2 cache), but it should be fairly easy to discover it with a fairly simple test case of a copy kernel. We can estimate the bandwidth of global memory simply by ensuring that our copy kernel uses a copy footprint that is much larger than the published L2 cache size (6MB for V100), whereas we can estimate the bandwidth of L2 by keeping our copy footprint smaller than that.

Such a code (IMO) is fairly trivial to write:

$ cat t44.cu
template 

__global__ void k(volatile T * __restrict__ d1, volatile T * __restrict__ d2, const int loops, const int ds){

  for (int i = 0; i < loops; i++)
    for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
      if (i&1) d1[j] = d2[j];
      else d2[j] = d1[j];
}
const int dsize = 1048576*128;
const int iter = 64;
int main(){

  int *d;
  cudaMalloc(&d, dsize);
  // case 1: 32MB copy, should exceed L2 cache on V100
  int csize = 1048576*8;
  k<<<80*2, 1024>>>(d, d+csize, iter, csize);
  // case 2: 2MB copy, should fit in L2 cache on V100
  csize = 1048576/2;
  k<<<80*2, 1024>>>(d, d+csize, iter, csize);
  cudaDeviceSynchronize();
}

$ nvcc -o t44 t44.cu
$ nvprof ./t44
==53310== NVPROF is profiling process 53310, command: ./t44
==53310== Profiling application: ./t44
==53310== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  6.9032ms         2  3.4516ms  123.39us  6.7798ms  void k(int volatile *, int volatile *, int, int)
      API calls:   89.47%  263.86ms         1  263.86ms  263.86ms  263.86ms  cudaMalloc
                    4.45%  13.111ms         8  1.6388ms  942.75us  2.2322ms  cuDeviceTotalMem
                    3.37%  9.9523ms       808  12.317us     186ns  725.86us  cuDeviceGetAttribute
                    2.34%  6.9006ms         1  6.9006ms  6.9006ms  6.9006ms  cudaDeviceSynchronize
                    0.33%  985.49us         8  123.19us  85.864us  180.73us  cuDeviceGetName
                    0.01%  42.668us         8  5.3330us  1.8710us  22.553us  cuDeviceGetPCIBusId
                    0.01%  34.281us         2  17.140us  6.2880us  27.993us  cudaLaunchKernel
                    0.00%  8.0290us        16     501ns     256ns  1.7980us  cuDeviceGet
                    0.00%  3.4000us         8     425ns     217ns     876ns  cuDeviceGetUuid
                    0.00%  3.3970us         3  1.1320us     652ns  2.0020us  cuDeviceGetCount
$

Based on the profiler output, we can estimate global memory bandwidth as:

2*64*32MB/6.78ms = 604GB/s

we can estimate L2 bandwidth as:

2*64*2MB/123us   = 2.08TB/s

Both of these are rough measurements (I'm not doing careful benchmarking here), but bandwidthTest on this V100 GPU reports a device memory bandwidth of ~700GB/s, so I believe the 600GB/s number is "in the ballpark". If we use that to judge that the L2 cache measurement is in the ballpark, then we might guess that the L2 cache may be ~3-4x faster than global memory in some circumstances.

Is memory operation for L2 cache significantly faster than global memory for NVIDIA GPU?

Answers (1)

Related Questions