BoringSession
BoringSession

Reputation: 83

NSight Compute not showing achieved occupancy in the metrics

I want to calculate the achieved occupancy and compare it with the value that is being displayed in Nsight Compute. ncu says: Theoretical Occupancy [%] 100, and Achieved Occupancy [%] 93,04. What parameters do i need to calculate this value?

I can see the theoretical occupancy using the occupancy api, which comes out as 1.0 or 100%.

I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__actice_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum. I can see the formaula to calculate the achieved occupancy from this SO answer.

Few details if that might help:

There are 1 CUDA devices.

CUDA Device #0
Major revision number:         7
Minor revision number:         5
Name:                          NVIDIA GeForce GTX 1650
Total global memory:           4093181952
Total constant memory:         65536
Total shared memory per block: 49152
Total registers per block:     65536
Total registers per multiprocessor: 65536
Warp size:                     32
Maximum threads per block:     1024
Maximum threads per multiprocessor: 1024
Maximum blocks per multiprocessor:     16
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   2147483647
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1515000
Maximum memory pitch:          2147483647
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     14
Kernel execution timeout:      Yes

ptxas info    : Used 18 registers, 360 bytes cmem[0]

Upvotes: 0

Views: 1557

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151964

Shorter:

In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.

Longer:

The metrics you have indicated:

I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__active_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum.

are not applicable to nsight compute. NVIDIA has made a variety of different profilers, and these metric names apply to other profilers. The article you reference refers to a different profiler (the original profiler on windows used the nsight name but was not nsight compute.)

This blog article discusses different ways to get valid nsight compute metric names with references to documentation links that present the metrics in different ways.

I would also point out for others that nsight compute has a whole report section dedicated to occupancy, and so for typical interest, that is probably the easiest way to go. Additional instructions for how to run nsight compute are available in this blog.

To come up with metrics that represent occupancy the way the nsight compute designers intended, my suggestion would be to look at their definitions. Each report section in nsight compute has "human-readable" files that indicate how the section is assembled. Since there is a report section for occupancy that includes reporting both theoretical and achieved occupancy, we can discover how those are computed by inspecting those files.

The methodology for how the occupancy section is computed is contained in 2 files which are part of a CUDA install. On a standard linux CUDA install, these will be in /usr/local/cuda-XX.X/nsight-compute-zzzzzz/sections/Occupancy.py and .../sections/Occupancy.section. The python file gives the exact names of the metrics that are used as well as the calculation method(s) for other displayed topics related to occupancy (e.g. notes, warnings, etc.) In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.

You could retrieve both the Occupancy section report (which is part of the "default" "set") as well as these specific metrics with a command line like this:

ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active  ./my-app

Here is an example output from such a run:

$ ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active  ./t2140
Testing with mask size = 3

==PROF== Connected to process 31551 (/home/user2/misc/t2140)
==PROF== Profiling "convolution_2D" - 1: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 2: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 460.922913 ms.

________________________________________________________________________

Testing with mask size = 5

==PROF== Profiling "convolution_2D" - 3: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 4: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 429.748230 ms.

________________________________________________________________________

Testing with mask size = 7

==PROF== Profiling "convolution_2D" - 5: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 6: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 500.704254 ms.

________________________________________________________________________

Testing with mask size = 9

==PROF== Profiling "convolution_2D" - 7: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 8: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 449.445892 ms.

________________________________________________________________________

==PROF== Disconnected from process 31551
[31551] [email protected]
  convolution_2D(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:44, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    sm__maximum_warps_per_active_cycle_pct                                               %                             50
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          40.42
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         815.21
    SM Frequency                                                             cycle/nsecond                           1.14
    Elapsed Cycles                                                                   cycle                         47,929
    Memory [%]                                                                           %                          23.96
    DRAM Throughput                                                                      %                          15.23
    Duration                                                                       usecond                          42.08
    L1/TEX Cache Throughput                                                              %                          26.90
    L2 Cache Throughput                                                                  %                          10.54
    SM Active Cycles                                                                 cycle                      42,619.88
    Compute (SM) [%]                                                                     %                          37.09
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                      1,024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,024
    Registers Per Thread                                                   register/thread                             38
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                      1,048,576
    Waves Per SM                                                                                                    12.80
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              1
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                             50
    Achieved Occupancy                                                                   %                          40.42
    Achieved Active Warps Per SM                                                      warp                          25.87
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (50.0%) is limited by the number of required registers

  convolution_2D_tiled(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:45, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    sm__maximum_warps_per_active_cycle_pct                                               %                            100
    sm__warps_active.avg.pct_of_peak_sustained_active                                    %                          84.01
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         771.98
    SM Frequency                                                             cycle/nsecond                           1.07
    Elapsed Cycles                                                                   cycle                         31,103
    Memory [%]                                                                           %                          40.61
    DRAM Throughput                                                                      %                          24.83
    Duration                                                                       usecond                          29.12
    L1/TEX Cache Throughput                                                              %                          46.39
    L2 Cache Throughput                                                                  %                          18.43
    SM Active Cycles                                                                 cycle                      27,168.03
    Compute (SM) [%]                                                                     %                          60.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
          what the compute pipelines are spending their time doing. Also, consider whether any computation is
          redundant and could be reduced or moved to look-up tables.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                      1,024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1,156
    Registers Per Thread                                                   register/thread                             31
    Shared Memory Configuration Size                                                 Kbyte                           8.19
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           4.10
    Threads                                                                         thread                      1,183,744
    Waves Per SM                                                                                                     7.22
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                             24
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          84.01
    Achieved Active Warps Per SM                                                      warp                          53.77
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
          theoretical (100.0%) and measured achieved occupancy (84.0%) can be the result of warp scheduling overheads
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
          as well as across blocks of the same kernel.

<sections repeat for each kernel launch>
$

Upvotes: 1

Related Questions