Reputation: 11
The architecture is something like this:
python-----[call]----->tensor flow for gpu----[call]---->CUDA SDK CUDA-----
[call]---> gpu binary to execute the job or something.
I have tried nvvp
to directly analyze from the python script. the result is something cost me 4.6G in memory. And the nvvp gui freezed. So basically I have no idea how to proceed.
Is there a possible way that I can know exactly which CUDA API this whole program called? This problem is not only for tensorflow, I need a general method to solve this so I can later test all related APIs to decide which GPU is suitable for our program.
Upvotes: 1
Views: 229
Reputation: 72339
The simplest way to do this is using the command line profiling tool nvprof
, with the api summary option, like this:
$ nvprof --print-api-summary ./a.out
3 6 7 5 3 5 6 2 9
1 2 7 0 9 3 6 0 6
2 6 1 8 7 9 2 0 2
3 7 5 9 2 2 8 9 7
==18840== NVPROF is profiling process 18840, command: ./a.out
0: 3.000000 5.000000 6.000000 2.000000
1: 9.000000 3.000000 6.000000 0.000000
2: 7.000000 9.000000 2.000000 0.000000
3: 2.000000 2.000000 8.000000 9.000000
3 6 7 5 3 3.142 6 2 9
1 2 7 0 9 3.142 6 0 6
2 6 1 8 7 3.142 2 0 2
3 7 5 9 2 3.142 8 9 7
==18840== Profiling application: ./a.out
==18840== Profiling result:
==18840== API calls:
Time(%) Time Calls Avg Min Max Name
41.57% 117.77ms 1 117.77ms 117.77ms 117.77ms cudaMallocPitch
31.45% 89.096ms 1 89.096ms 89.096ms 89.096ms cudaFree
26.61% 75.398ms 1 75.398ms 75.398ms 75.398ms cudaDeviceReset
0.14% 390.33us 1 390.33us 390.33us 390.33us cudaLaunch
0.09% 252.51us 91 2.7740us 247ns 98.999us cuDeviceGetAttribute
0.08% 225.51us 1 225.51us 225.51us 225.51us cuDeviceTotalMem
0.04% 101.02us 1 101.02us 101.02us 101.02us cudaDeviceSynchronize
0.02% 43.777us 2 21.888us 21.009us 22.768us cudaMemcpy2D
0.01% 32.867us 1 32.867us 32.867us 32.867us cuDeviceGetName
0.00% 4.1070us 4 1.0260us 188ns 3.2290us cudaSetupArgument
0.00% 3.3560us 3 1.1180us 332ns 2.4330us cuDeviceGetCount
0.00% 2.1280us 3 709ns 265ns 1.2330us cuDeviceGet
0.00% 1.2200us 1 1.2200us 1.2200us 1.2200us cudaConfigureCall
0.00% 885ns 1 885ns 885ns 885ns cudaPeekAtLastError
This shows all the driver and runtime API calls which the program executed over the life of the CUDA context associated with the application.
Upvotes: 1