Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

Question

Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions:

How to evaluate CUDA performance?

How Do You Profile & Optimize CUDA Kernels?

How to calculate Gflops of a kernel

Counting FLOPS/GFLOPS in program - CUDA

How to calculate the achieved bandwidth of a CUDA kernel

Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and achieved bandwidth for this kernel. Then show the Visual Profiler results for this kernel. I.e. show the results in detail with all the information obtained step by step for this simple CUDA kernel. That would be really helpful for most of us. Thanks!

Greg Smith · Accepted Answer

Nsight Visual Studio Edition 2.1 and Above

The information you requested is available if you collect Achieved FLOPS experiment and Memory Statistics - Buffers experiment.

Visual Profiler 4.2 and Above

Achieved Bandwidth: When mouse over a kernel in the Timeline this information the information is available in the Properties Pane under Memory\DRAM Utilization.

The profiler cannot collect FLOPS count yet. This can be done by running cuobjdump -sass to view the assembly code. Step through the kernel and count single and double precision floating points instructions multiplying FMA and DFMA operations by 2. Each instruction should also be multiplied by the predicated true threads. You also have to account for control flow. This is not fun and requires someone with a strong knowlege of the instruction set. This may be better accomplished by single stepping the assembly in the debugger. The duration of the kernel is available in the Visual Profiler Properties Pane and Details Pane as Duration.

Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

Answers (2)

Related Questions