Reputation: 4923
I've been running kernel of CUDA programs. I observe that there is considerable difference between time reported by GPU counters and NVVP for kernel execution. Why such difference is usually observed?
Upvotes: 0
Views: 1265
Reputation: 11549
Nsight Visual Studio Edition and the Visual Profiler support two mechanism for capturing the duration of the kernel. Both of these methods will result in a value smaller and more accurate than what is reported by CUevent/cudaEvent. The methods are as follows:
This is the default mode used by Nsight 2.x and Visual Profiler 5.0 to generate a timeline. The duration of a kernel is defined as the time from when the kernel code starts executing on the device to the time that it completes. This cannot be measured using CUDA events.
This is the default mode used by tools when collecting PM counters for each kernel. The duration of a kernel is defined as the time the GPU processes the launch request until the GPU idles after completion of the kernel. This mode specifically disables concurrent kernel execution. In almost all cases the reported duration will be slightly larger than the concurrent kernel trace duration as it includes time for the GPU to launch the first block and time for the GPU to complete all memory stores.
CUDA event timing is done by calling cu/cudaEventRecord before and after the kernel launch on the same stream. Each event record inserts a command into the GPU push buffer. When the command reaches the GPU it writes a timestamp to memory. It is possible to push two event records without a launch. This allows a developer to measure the GPU time between the two timestamp commands. This method has the following disadvantages and it is why I encourage developers to use the tools (Nsight, Visual Profiler, and CUPTI):
b. The GPU can context switch between the start event record and the kernel execution.
c. The start event record will include launch overhead including time to update driver buffers that need to be resized, copy parameters, copy texture bindings, ...
d. The elapsed time between submitting the kernel and the end event record can impact the timing.
e. The GPU can context switch between the end of the kernel execution and the end event record.
f. Incorrect use of events will break concurrent kernel execution.
The duration provide in each of these modes will provide different values. Furthermore the definition of duration provided by tools and those available through use of events is different.
The NVIDIA tools define duration as best as possible as the time from when the GPU starts working on the kernel to when the GPU completes work on the kernel. If a developer is interested in collecting this information they should look at the CUPTI SDK included with the toolkit.
Upvotes: 4