Reputation: 69
Suppose you have a memory bound GPU kernel, how close can you get to the stated theoretical bandwidth of the GPU? Even in Mark Harris's Optimising Parallel Reduction presentation he 'only' gets 63GB/sec which is about 73% of the bandwidth of his test GPU (a G80) which he claimed 84.6GB/sec peak bandwidth. Could Harris have optimised his kernel further? are there other techniques which which were possibly to advanced/out of scope for the presentation? eg __shfl type instructions? Why didn't he achieve a higher bandwidth?
This article claims, using a test machine with a Tesla C2050
"throughput is memory-bandwidth limited, sustaining around 75% of the 144 GB/s peak memory bandwidth, compared to a practical limit of 85% of peak when accounting for overheads such as DRAM refresh."
Is this correct? The authors don't provide a source for the "85% practical bandwidth limit" and I haven't been able to find anything else mentioning it. If so, what other factors (supposing you have a very well optimised kernel) would prevent you from reaching theoretical peak bandwidth?
Upvotes: 3
Views: 437
Reputation: 1904
A similar thread: GPU Memory bandwidth theoretical vs practical
Running a minimal kernel that only writes data to a 1D large vector:
__global__ void kernel( int *out ) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
out[idx] = idx%4;
}
on GeForce GT 710 I got 0.9 of the theoretical bandwidth
practical 12.9 GB/s.
theoretical (spec) 14.4 GB/s
One thing that might contribute to the slow down is the caching.
Upvotes: 0