Sagar Masuti
Sagar Masuti

Reputation: 1301

cuFFT profiling issue

I am trying to get the profiling data for cuFFT library calls for example plan and exec. I am using nvprof (command line profiling tool), with option of "--print-api-trace". It prints the time for all the apis except the cuFFT apis. Is there a any flag i need to change to get the cuFFT profiling data ? Or I need to use the events and measure myself ??

Upvotes: 0

Views: 317

Answers (3)

Lowie
Lowie

Reputation: 11

NVIDIA's NSight System (nsys) and NSight Compute (ncu) are the newer tools you want to look at. Both are available in the GUI and the CLI version.

NSight System

This tool allows system-wide benchmarking with low overheads, allowing developers to view performance stats and identify bottlenecks. The stats include memory transfer between host and device, kernel runtime, and streams and device synchronization time.

Run this to start a profiling session:

nsys profile --output <report-output-file> --gpu-metrics-devices=all <your-executable(s)>

After a profiling session, an nsys-rep program will be generated and available for analysis. You can either import them into the GUI, or run this command:

nsys stats <your-.nsys-rep-file>

Here's the snippet of the output:

** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

Time (%)  Total Time (ns)  Instances    Avg (ns)      Med (ns)     Min (ns)    Max (ns)    StdDev (ns)                                                   Name                                                
--------  ---------------  ---------  ------------  ------------  ----------  -----------  ------------  ----------------------------------------------------------------------------------------------------
    70.1    4,178,454,702         72  58,034,093.1  57,898,957.5   1,244,084  265,042,382  52,417,885.5  fft_step_2_2(Complex *, Complex *, int, int, bool)                                                  
    17.3    1,033,805,010         72  14,358,402.9  13,011,089.0  12,910,621   40,751,504   5,340,200.6  calculate_w(Complex *, int)                                                                         
     3.8      228,587,479         72   3,174,826.1   1,063,724.5   1,056,748  152,983,858  17,903,830.7  copy_first_half_2(Complex *, int)                                                                   
     2.0      118,304,890         72   1,643,123.5   1,553,585.0   1,251,029    2,656,751     414,602.5  fft_step_2(Complex *, Complex *, int, int, bool)                                                    
     1.9      113,549,795          2  56,774,897.5  56,774,897.5  38,970,469   74,579,326  25,179,264.3  copyDoubleToComplex(double *, Complex *, int)

NSight Compute

Compared to NSight System, this tool benchmarks on a much lower level: Cache hit/miss, memory accessing statistics, block utilization, etc. Because of that, it requires a lot of overhead, and each profiling session would be slower.

To start a profiling session:

ncu -o <report-output-file> <your-executable>

After the session has finished, one .ncu-rep file will be generated. You can either import it into the GUI program, or extract the stats using ncu -i:

ncu -i <your-.ncu-rep-file>

Sample output:

complexMultiplyKernel(double2 *, double2 *, int) (256, 1, 1)x(32, 1, 1), Context 1, Stream 14, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         6.73
    SM Frequency                    Ghz         1.36
    Elapsed Cycles                cycle    2,107,144
    Memory Throughput                 %        80.69
    DRAM Throughput                   %        80.69
    Duration                         ms         1.54
    L1/TEX Cache Throughput           %        14.76
    L2 Cache Throughput               %        28.73
    SM Active Cycles              cycle 2,043,466.81
    Compute (SM) Throughput           %        23.53
    ----------------------- ----------- ------------

    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To   
          further improve performance, work will likely need to be shifted from the most utilized to another unit.      
          Start by analyzing DRAM in the Memory Workload Analysis section.                                              

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    32
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                    256
    Registers Per Thread             register/thread              28
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block        byte/block               0
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM              68
    Threads                                   thread           8,192
    Uses Green Context                                             0
    Waves Per SM                                                0.24
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           64
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block           32
    Theoretical Active Warps per SM        warp           16
    Theoretical Occupancy                     %           50
    Achieved Occupancy                        %        11.83
    Achieved Active Warps Per SM           warp         3.78
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 76.35%                                                                                    
          The difference between calculated theoretical (50.0%) and measured achieved occupancy (11.8%) can be the      
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     
          optimizing occupancy.                                                                                         
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Local Speedup: 50%                                                                                       
          The 4.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the       
          hardware maximum of 8. This kernel's theoretical occupancy (50.0%) is limited by the number of blocks that    
          can fit on the SM. This kernel's theoretical occupancy (50.0%) is limited by the required amount of shared    
          memory.                                                                                                       

Upvotes: 1

Robert Crovella
Robert Crovella

Reputation: 151799

According to the nvprof documentation, api-trace-mode:

API-trace mode shows the timeline of all CUDA runtime and driver API calls

cuFFT is neither the CUDA runtime API nor the CUDA driver API. It is a library of routines for FFT, whose documentation is here.

You can still use either nvprof, the command line profiler, or the visual profiler, to gather data about how cuFFT uses the GPU, of course.

Upvotes: 5

Sagar Masuti
Sagar Masuti

Reputation: 1301

Got it working.. Instead of using the nvprof i used the CUDA_PROFILE environment variable.

Upvotes: -2

Related Questions