Understanding Memory Replays and In-Flight Requests

Question

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2)

__global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) {
    unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;

    if (ix < nx && iy < ny) {
           out[iy*nx + ix] = in[ix*ny + iy]; // 
           // out[ix*ny + iy] = in[iy*nx + ix]; // for by row
    }
}

This is what I don't understand: The load throughput for for transposeNaiveCol() is 642.33 GB/s and for tranposeNaiveRow() is 129.05 GB/s. The author says:

The results show that the highest load throughput is obtained with cached, strided reads. In the case of cached reads, each memory request is serviced with a 128-byte cache line. Reading data by columns causes each memory request in a warp to replay 32 times (because the stride is 2048 data elements), resulting in good latency hiding from many in-flight global memory reads and then excellent L1 cache hit ratios once bytes are pre-fetched into L1 cache.

My question: I thought that aligned/coalesced reads were ideal, but here it seems that strided reads improve performance.

Why is reading a cache line conducive to reduced performance in this case?
Aren't replays in general a bad thing? It mentions here that it results in "good latency hiding".

user703016 · Accepted Answer

Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric that really matters is the wall clock time that your kernel takes to completion, of which the authors make no mention.

That being said, kernels usually fall into two categories:

Compute bound kernels, whose performance can be increased by trying to hide instruction latency: keep the pipeline full (maximize ILP).
I/O bound kernels, whose performance can be increased by trying to hide memory latency: keep data in flight (maximize bandwidth).

Matrix transpose being of very low compute intensity, it is therefore I/O bound, and as such to get better performance you should try to increase bandwidth usage.

Why is the column transpose better at maximizing bandwidth usage?

In the case of the row transpose, reads are coalesced: a single 128 bytes transaction is served per warp, that is 4 bytes per thread. Those 128 bytes are put in cache but are never reused, so the cache is effectively of no use in this case.

In the case of the column transpose, reads are not coalesced: each warp gets served 32 transactions of 128 bytes, all of which will get into L1 and will be reused for the next 31 replays (assuming they didn't get kicked out of cache). That is very low load efficiency for very high effective load throughput, and maximal cache usage.

You could of course get the same effect in the row transpose by simply requesting more data per thread (for example by loading 32 float, or 8 float4 per thread) or using CUDA's prefetch capabilities.

Understanding Memory Replays and In-Flight Requests

Answers (1)

Related Questions