Reputation: 325
Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions:
How to evaluate CUDA performance?
How Do You Profile & Optimize CUDA Kernels?
How to calculate Gflops of a kernel
Counting FLOPS/GFLOPS in program - CUDA
How to calculate the achieved bandwidth of a CUDA kernel
Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and achieved bandwidth for this kernel. Then show the Visual Profiler results for this kernel. I.e. show the results in detail with all the information obtained step by step for this simple CUDA kernel. That would be really helpful for most of us. Thanks!
Upvotes: 1
Views: 5827
Reputation: 21
You could follow the calculations of Mark Harris in Optimizing Parallel Reductions in CUDA. There he uses the input data as base and divides it through the time of the kernel execution. In the examples he used 2^22 ints so he has 0,016777216 GB of input data. The first kernel took 8,054 ms which is an achieved bandwidth of 2,083 GB/s.
After several optimizations he approached 62,671 GB/s and compares it to the peak performance of the used GPU which is at 86,4 GB/s.
Although he used ints you can easily adapt that to flops/Gflops.
Upvotes: 0
Reputation: 11529
Nsight Visual Studio Edition 2.1 and Above
The information you requested is available if you collect Achieved FLOPS experiment and Memory Statistics - Buffers experiment.
Visual Profiler 4.2 and Above
Achieved Bandwidth: When mouse over a kernel in the Timeline this information the information is available in the Properties Pane under Memory\DRAM Utilization.
The profiler cannot collect FLOPS count yet. This can be done by running cuobjdump -sass to view the assembly code. Step through the kernel and count single and double precision floating points instructions multiplying FMA and DFMA operations by 2. Each instruction should also be multiplied by the predicated true threads. You also have to account for control flow. This is not fun and requires someone with a strong knowlege of the instruction set. This may be better accomplished by single stepping the assembly in the debugger. The duration of the kernel is available in the Visual Profiler Properties Pane and Details Pane as Duration.
Upvotes: 1