In this NVIDIA blog post, why was copying faster via shared memory?

Question

A few years back, NVIDIA's Mark Harris posted this:

An Efficient Matrix Transpose in CUDA C/C++

in which he described how to perform matrix transposition faster using shared memory over the naive approach. For methodological purposes, he also implemented a shared-memory-tile-based version of simple matrix copy.

Somewhat surprisingly, copying through shared memory tiles performed faster than the "naive" copy (with a 2D grid): 136 GB/sec for the naive copy, 152.3 GB/sec for shared-mem-tile-based copy. That was on a Kepler micro-architecture card, the Tesla K20c.

My question: Why does this make sense? That is, why was the effective bandwidth not lower when all that's done is coalesced reading and writing? Specifically, did it have something to do with the fact the __restrict wasn't used (and thus __ldg() was probably not used)?

Note: This questions is not about transposition. The post was about transposition, and its lessons are well taken. It did not discuss the odd phenomenon involving the simple, non-transposed copying.

In this NVIDIA blog post, why was copying faster via shared memory?

Answers (1)

Related Questions