Tyson Hilmer
Tyson Hilmer

Reputation: 771

How to get average execution time of CUDA kernel using NSight Systems or NSight Compute

Suppose I have a simple CLI test app named "Foo". This app executes a kernel "Bar" 100 times in a loop. How may I obtain an average kernel execution time for Bar, using Nsight Systems or Nsight Compute, either the GUI or CLI versions of these apps.

The Nvidia Visual Profiler app provides this information in the Properties dialog, for each kernel, as "Duration (kernel)" and Invocations.

I would like to obtain the same information with Systems or Compute. Because Visual Profiler is to be deprecated.

Following the example in this post

nv-nsight-cu-cli -k Bar Foo

I get a 100x printouts, one for each kernel execution. I want just summary information for kernel Bar.

Upvotes: 1

Views: 1223

Answers (2)

Zois Tasoulas
Zois Tasoulas

Reputation: 1553

Using nsys you can use

nsys stats -r cuda_kern_exec_sum <nsys-rep report>

Check also the :base, :mangled options for the report.

For more information on the report output you can use

nsys stats --help-reports=cuda_kern_exec_sum

Upvotes: 0

Anis Ladram
Anis Ladram

Reputation: 1605

You can achieve this with the Nsight Compute CLI using option --print-summary per-gpu: it provides a minimum, maximum and average execution time. Example below:

$ ncu -k matrixMul --print-summary per-gpu ./test | grep -C8 Duration
      ----------------------- ------------- ---------- ---------- ----------
      Metric Name               Metric Unit    Minimum    Maximum    Average
      ----------------------- ------------- ---------- ---------- ----------
      DRAM Frequency          cycle/nsecond       6.72       6.90       6.79
      SM Frequency            cycle/nsecond       1.48       1.51       1.49
      Elapsed Cycles                  cycle 166,647.00 168,469.00 167,522.43
      Memory Throughput                   %      73.43      74.10      73.76
      DRAM Throughput                     %       2.50       2.57       2.53
      Duration                      usecond     111.20     112.90     112.18
      L1/TEX Cache Throughput             %      84.50      85.35      84.99
      L2 Cache Throughput                 %      10.40      10.64      10.54
      SM Active Cycles                cycle 144,432.91 145,882.70 145,043.22
      Compute (SM) Throughput             %      73.43      74.10      73.76
      ----------------------- ------------- ---------- ---------- ----------

      Section: Launch Statistics
      -------------------------------- --------------- ---------- ---------- ----------

Upvotes: 3

Related Questions