Reputation: 83
I want to calculate the achieved occupancy and compare it with the value that is being displayed in Nsight Compute.
ncu says: Theoretical Occupancy [%] 100
, and Achieved Occupancy [%] 93,04
. What parameters do i need to calculate this value?
I can see the theoretical occupancy using the occupancy api, which comes out as 1.0 or 100%.
I tried looking for the metric achieved_occupancy
, sm__active_warps_sum
, sm__actice_cycles_sum
but all of them say: Failed to find metric sm__active_warps_sum
. I can see the formaula to calculate the achieved occupancy from this SO answer.
Few details if that might help:
There are 1 CUDA devices.
CUDA Device #0
Major revision number: 7
Minor revision number: 5
Name: NVIDIA GeForce GTX 1650
Total global memory: 4093181952
Total constant memory: 65536
Total shared memory per block: 49152
Total registers per block: 65536
Total registers per multiprocessor: 65536
Warp size: 32
Maximum threads per block: 1024
Maximum threads per multiprocessor: 1024
Maximum blocks per multiprocessor: 16
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1515000
Maximum memory pitch: 2147483647
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 14
Kernel execution timeout: Yes
ptxas info : Used 18 registers, 360 bytes cmem[0]
Upvotes: 0
Views: 1557
Reputation: 151964
Shorter:
In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct
and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active
.
Longer:
The metrics you have indicated:
I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__active_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum.
are not applicable to nsight compute. NVIDIA has made a variety of different profilers, and these metric names apply to other profilers. The article you reference refers to a different profiler (the original profiler on windows used the nsight name but was not nsight compute.)
This blog article discusses different ways to get valid nsight compute metric names with references to documentation links that present the metrics in different ways.
I would also point out for others that nsight compute has a whole report section dedicated to occupancy, and so for typical interest, that is probably the easiest way to go. Additional instructions for how to run nsight compute are available in this blog.
To come up with metrics that represent occupancy the way the nsight compute designers intended, my suggestion would be to look at their definitions. Each report section in nsight compute has "human-readable" files that indicate how the section is assembled. Since there is a report section for occupancy that includes reporting both theoretical and achieved occupancy, we can discover how those are computed by inspecting those files.
The methodology for how the occupancy section is computed is contained in 2 files which are part of a CUDA install. On a standard linux CUDA install, these will be in /usr/local/cuda-XX.X/nsight-compute-zzzzzz/sections/Occupancy.py
and .../sections/Occupancy.section
. The python file gives the exact names of the metrics that are used as well as the calculation method(s) for other displayed topics related to occupancy (e.g. notes, warnings, etc.) In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct
and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active
.
You could retrieve both the Occupancy section report (which is part of the "default" "set") as well as these specific metrics with a command line like this:
ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./my-app
Here is an example output from such a run:
$ ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./t2140
Testing with mask size = 3
==PROF== Connected to process 31551 (/home/user2/misc/t2140)
==PROF== Profiling "convolution_2D" - 1: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 2: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 460.922913 ms.
________________________________________________________________________
Testing with mask size = 5
==PROF== Profiling "convolution_2D" - 3: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 4: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 429.748230 ms.
________________________________________________________________________
Testing with mask size = 7
==PROF== Profiling "convolution_2D" - 5: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 6: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 500.704254 ms.
________________________________________________________________________
Testing with mask size = 9
==PROF== Profiling "convolution_2D" - 7: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 8: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 449.445892 ms.
________________________________________________________________________
==PROF== Disconnected from process 31551
[31551] [email protected]
convolution_2D(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:44, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 50
sm__warps_active.avg.pct_of_peak_sustained_active % 40.42
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 815.21
SM Frequency cycle/nsecond 1.14
Elapsed Cycles cycle 47,929
Memory [%] % 23.96
DRAM Throughput % 15.23
Duration usecond 42.08
L1/TEX Cache Throughput % 26.90
L2 Cache Throughput % 10.54
SM Active Cycles cycle 42,619.88
Compute (SM) [%] % 37.09
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,024
Registers Per Thread register/thread 38
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1,048,576
Waves Per SM 12.80
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 1
Block Limit Shared Mem block 32
Block Limit Warps block 2
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 50
Achieved Occupancy % 40.42
Achieved Active Warps Per SM warp 25.87
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (50.0%) is limited by the number of required registers
convolution_2D_tiled(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:45, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 100
sm__warps_active.avg.pct_of_peak_sustained_active % 84.01
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 771.98
SM Frequency cycle/nsecond 1.07
Elapsed Cycles cycle 31,103
Memory [%] % 40.61
DRAM Throughput % 24.83
Duration usecond 29.12
L1/TEX Cache Throughput % 46.39
L2 Cache Throughput % 18.43
SM Active Cycles cycle 27,168.03
Compute (SM) [%] % 60.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
what the compute pipelines are spending their time doing. Also, consider whether any computation is
redundant and could be reduced or moved to look-up tables.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,156
Registers Per Thread register/thread 31
Shared Memory Configuration Size Kbyte 8.19
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 4.10
Threads thread 1,183,744
Waves Per SM 7.22
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 24
Block Limit Warps block 2
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 84.01
Achieved Active Warps Per SM warp 53.77
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
theoretical (100.0%) and measured achieved occupancy (84.0%) can be the result of warp scheduling overheads
or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
as well as across blocks of the same kernel.
<sections repeat for each kernel launch>
$
Upvotes: 1