Reputation: 1301
I am trying to get the profiling data for cuFFT library calls for example plan and exec. I am using nvprof (command line profiling tool), with option of "--print-api-trace". It prints the time for all the apis except the cuFFT apis. Is there a any flag i need to change to get the cuFFT profiling data ? Or I need to use the events and measure myself ??
Upvotes: 0
Views: 317
Reputation: 11
NVIDIA's NSight System (nsys) and NSight Compute (ncu) are the newer tools you want to look at. Both are available in the GUI and the CLI version.
This tool allows system-wide benchmarking with low overheads, allowing developers to view performance stats and identify bottlenecks. The stats include memory transfer between host and device, kernel runtime, and streams and device synchronization time.
Run this to start a profiling session:
nsys profile --output <report-output-file> --gpu-metrics-devices=all <your-executable(s)>
After a profiling session, an nsys-rep
program will be generated and available for analysis. You can either import them into the GUI, or run this command:
nsys stats <your-.nsys-rep-file>
Here's the snippet of the output:
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ----------- ------------ ----------------------------------------------------------------------------------------------------
70.1 4,178,454,702 72 58,034,093.1 57,898,957.5 1,244,084 265,042,382 52,417,885.5 fft_step_2_2(Complex *, Complex *, int, int, bool)
17.3 1,033,805,010 72 14,358,402.9 13,011,089.0 12,910,621 40,751,504 5,340,200.6 calculate_w(Complex *, int)
3.8 228,587,479 72 3,174,826.1 1,063,724.5 1,056,748 152,983,858 17,903,830.7 copy_first_half_2(Complex *, int)
2.0 118,304,890 72 1,643,123.5 1,553,585.0 1,251,029 2,656,751 414,602.5 fft_step_2(Complex *, Complex *, int, int, bool)
1.9 113,549,795 2 56,774,897.5 56,774,897.5 38,970,469 74,579,326 25,179,264.3 copyDoubleToComplex(double *, Complex *, int)
Compared to NSight System, this tool benchmarks on a much lower level: Cache hit/miss, memory accessing statistics, block utilization, etc. Because of that, it requires a lot of overhead, and each profiling session would be slower.
To start a profiling session:
ncu -o <report-output-file> <your-executable>
After the session has finished, one .ncu-rep
file will be generated. You can either import it into the GUI program, or extract the stats using ncu -i
:
ncu -i <your-.ncu-rep-file>
Sample output:
complexMultiplyKernel(double2 *, double2 *, int) (256, 1, 1)x(32, 1, 1), Context 1, Stream 14, Device 0, CC 7.5
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 6.73
SM Frequency Ghz 1.36
Elapsed Cycles cycle 2,107,144
Memory Throughput % 80.69
DRAM Throughput % 80.69
Duration ms 1.54
L1/TEX Cache Throughput % 14.76
L2 Cache Throughput % 28.73
SM Active Cycles cycle 2,043,466.81
Compute (SM) Throughput % 23.53
----------------------- ----------- ------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 32
Function Cache Configuration CachePreferNone
Grid Size 256
Registers Per Thread register/thread 28
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 68
Threads thread 8,192
Uses Green Context 0
Waves Per SM 0.24
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 64
Block Limit Shared Mem block 16
Block Limit Warps block 32
Theoretical Active Warps per SM warp 16
Theoretical Occupancy % 50
Achieved Occupancy % 11.83
Achieved Active Warps Per SM warp 3.78
------------------------------- ----------- ------------
OPT Est. Local Speedup: 76.35%
The difference between calculated theoretical (50.0%) and measured achieved occupancy (11.8%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
----- --------------------------------------------------------------------------------------------------------------
OPT Est. Local Speedup: 50%
The 4.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the
hardware maximum of 8. This kernel's theoretical occupancy (50.0%) is limited by the number of blocks that
can fit on the SM. This kernel's theoretical occupancy (50.0%) is limited by the required amount of shared
memory.
Upvotes: 1
Reputation: 151799
According to the nvprof documentation, api-trace-mode:
API-trace mode shows the timeline of all CUDA runtime and driver API calls
cuFFT is neither the CUDA runtime API nor the CUDA driver API. It is a library of routines for FFT, whose documentation is here.
You can still use either nvprof, the command line profiler, or the visual profiler, to gather data about how cuFFT uses the GPU, of course.
Upvotes: 5
Reputation: 1301
Got it working.. Instead of using the nvprof i used the CUDA_PROFILE environment variable.
Upvotes: -2