Reputation: 35471
I want a measure of how much of the peak memory bandwidth my kernel archives.
Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:
...
for(int k=0;k>4000;k++){
float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
....
}
out_data[index]=result;
out_data2[index]=sqrt(result);
...
I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:
Profiler output:
method gputime cputime occupancy instruction warp_serial memtransfer
memcpyHtoD 10.944 17 16384
fill 64.32 93 1 14556 0
fill 64.224 83 1 14556 0
memcpyHtoD 10.656 11 16384
fill 64.064 82 1 14556 0
memcpyHtoD 1172.96 1309 4194304
memcpyHtoD 10.688 12 16384
cu_more_regT 93223.906 93241 1 40716656 0
memcpyDtoH 1276.672 1974 4194304
memcpyDtoH 1291.072 2019 4194304
memcpyDtoH 1278.72 2003 4194304
memcpyDtoH 1840 3172 4194304
New question:
- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time
~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?
p.s. Tell me if you need other counters for your answer.
sister question: How to calculate Gflops of a kernel
Upvotes: 2
Views: 5484
Reputation: 3127
Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).
Regarding to your question, to calculate the achieved global memory throughput:
Compute Visual Profiler. DU-05162-001_v02 | October 2010. User Guide. Page 56, Table 7. Supported Derived Statistics.
Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / gputime
Hope this help.
Upvotes: 4