Reputation: 132108
A few years back, NVIDIA's Mark Harris posted this:
An Efficient Matrix Transpose in CUDA C/C++
in which he described how to perform matrix transposition faster using shared memory over the naive approach. For methodological purposes, he also implemented a shared-memory-tile-based version of simple matrix copy.
Somewhat surprisingly, copying through shared memory tiles performed faster than the "naive" copy (with a 2D grid): 136 GB/sec for the naive copy, 152.3 GB/sec for shared-mem-tile-based copy. That was on a Kepler micro-architecture card, the Tesla K20c.
My question: Why does this make sense? That is, why was the effective bandwidth not lower when all that's done is coalesced reading and writing? Specifically, did it have something to do with the fact the __restrict
wasn't used (and thus __ldg()
was probably not used)?
Note: This questions is not about transposition. The post was about transposition, and its lessons are well taken. It did not discuss the odd phenomenon involving the simple, non-transposed copying.
Upvotes: 5
Views: 295
Reputation: 6391
Unlikely that this was GDDR5 reads/writes, as that should have been buffered entirely by L2 cache and masked by high occupancy. Neither the coalesced reads / writes (or the lack thereof), even though Kepler was easily slowed down by these.
All we are seeing here, is a longer pipeline between the read and the write, which masks whatever latency is left on the read operation.
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];
Without __restrict
, the compiler has to assume data dependency between loop iterations, so each iteration has to implicitly synchronize on the previous one. That's not even the effect of not using __ldg()
(going through the texture unit doesn't make a difference if no data re-use is likely), but a straight stall on global memory read.
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x];
__syncthreads();
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x];
This on the other hand does not have to stall except for the last few rows before the sync. Assume that the compiler had unrolled these simple loops, and it becomes obvious.
The __syncthreads();
in there is even counterproductive in this specific case, there was no good reason to wait for last rows to complete their read before beginning the write-out.
Upvotes: 1