Reputation: 8889
Surface memory is the write-only analogue to the texture cache in CUDA.
I've found NVIDIA GPU peak bandwidth numbers in academic literature for reading from global memory and shared memory. However, I've found less information on the write throughput of CUDA memory devices.
In particular, I'm interested in the bandwidth (and latency too, if known) of the CUDA surface memory on Fermi and Kepler GPUs.
Upvotes: 1
Views: 934
Reputation: 11549
On compute capability 2.x and 3.x devices surface writes go through the L1 cache and have the same throughput and latency as global writes.
Upvotes: 2
Reputation: 9779
According to Device Memory Accesses,
Since latencies of texture/surface/global mem are almost the same, and all of them locate on off-chip DRAM, I think the peak bandwidth of surface mem is same to global mem indicated in the GPU specs.
In order to timing the latency, the paper you referenced may use only one thread. So it's easy to calculate the latency by
global mem read latency = total read time / number of read
You could implement your timing on surface write in a similar fashion. But I don't think it is reasonable to apply this method on shared mem latency measurement as shown in that paper, since the overhead of the for loop may not be ignored compared to the shared mem latency.
Upvotes: 2