Reputation: 234
I am trying to use gprof to profile some numerical code I am developing, but gprof seems to fail to collect data from my program. Here is my command line:
g++ -Wall -O3 -g -pg -o fftw_test fftw_test.cpp -lfftw3 -lfftw3_threads -lm && ./fftw_test
The gmon.out file is created, but it seems to have no data. When i run
gprof -b fftw_test gmon.out > gprof.out
All I get is
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time self children called name
Index by function name
Any insights?
The code does a lot of stuff, it does not simply call FFTW routines. It has functions that compute certain complex coefficients, functions that multiply input data by these coefficients, and so on.
Edit.: Including example code and results.
#include <cstdlib>
#include <ctime>
int main()
{
std::srand( std::time( 0 ) );
double sum = 0.0;
for ( int i = 0; i < RAND_MAX; ++i )
sum += std::rand() / ( double ) RAND_MAX;
std::cout << sum << '\n';
return 0;
}
Command lines:
$ g++ -Wall -O3 -g -pg -o gprof_test gprof_test.cpp && ./gprof_test
1.07374e+09
$ gprof -b gprof_test gmon.out > gprof.out
$ cat gprof.out
Result:
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
Call graph
granularity: each sample hit covers 2 byte(s) no time propagated
index % time self children called name
Index by function name
And that's it.
Upvotes: 9
Views: 8291
Reputation: 369
I presume the problem comes from the fact you are using O3
level of optimisation. With gcc-8.4.0 I get nothing with O3
, limited data (e.g. number of function calls missing) with O2
and proper profile for O1
and O0
.
This seems to have been a known bug with older versions of gcc, yet I did not come across any sources regarding such a problem with later versions. I can only hypothesize whether it is a bug with the compiler or whether more aggressive optimisations prevent some performance data from being collected.
Upvotes: 1
Reputation: 4348
If you're using gcc 6, you're most likely running into this bug (note that the bug is not specific to Debian but depends on how gcc was built). A workaround is simply to compile using the `-no-pie' option which disables position-independent code generation.
This is a good start if you want to know more about PIE.
Upvotes: 8
Reputation: 94445
gprof seems to fail to collect data from my program. Here is my command line:
g++ -Wall -O3 -g -pg -o fftw_test fftw_test.cpp -lfftw3 -lfftw3_threads -lm && ./fftw_test
Your program uses fftw library and probably consist almost only of fftw library calls. What is the running time? Your program may be too fast to be profiled with gprof. Update And library may be not seen by gprof as it was compiled without gprof profiling enabled.
GNU gprof has two parts. First, it instruments function calls in c/cpp files which were compiled with -pg
option (with mcount function calls - https://en.wikipedia.org/wiki/Gprof) - to get caller/callee info. Second, it links additional profiling library into your executable to add periodic sampling to find which code was executed for more time. Sampling is done with profil (setitimer). Setitimer profiling has limited resolution and can't resolve intervals less than 10 ms or 1 ms (100 or 1000 samples per second).
And in your example, the fftw library was probably compiled without instrumentation, so no mcount
calls in it. It still can be captured by sampling part, but only for main thread of the program (https://en.wikipedia.org/wiki/Gprof - "typically it only profiles the main thread of application").
perf
profiler have no instrumentation with mcount
(it gets callee/caller from stack unwinding when recorded with -g
option), but it have much better statistics/sampling variants (it can use hardware PMU counters), without 100 or 1000 Hz limit, and it supports (profiles) threads correctly. Try perf record -F1000 ./fftw_test
(with 1 kHz sampling frequency) and perf report
or perf report > report.txt
. There are some GUI/HTML frontends to perf too: https://github.com/KDAB/hotspot https://github.com/jrfonseca/gprof2dot
For better setitimer style profiler check google-perftools https://github.com/gperftools/gperftools for "CPU PROFILER".
======
With your test I have some gprof results on Debian 8.6 Linux kernel version 3.16.0-4-amd64 x86_64 machine, g++ (Debian 4.9.2-10), gprof is "GNU gprof (GNU Binutils for Debian) 2.27"
$ cat gprof_test.cpp
#include <cstdlib>
#include <ctime>
#include <iostream>
int main()
{
std::srand( std::time( 0 ) );
double sum = 0.0;
for ( int i = 0; i < 100000000; ++i )
sum += std::rand() / ( double ) RAND_MAX;
std::cout << sum << '\n';
return 0;
}
$ g++ -Wall -O3 -g -pg -o gprof_test gprof_test.cpp && time ./gprof_test
5.00069e+06
real 0m0.992s
$ gprof -b gprof_test gmon.out
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
So, gprof did not catch any time samples in this 1 second example and have no information about calls into the libraries (they were compiled without -pg
). After adding some wrapper functions and prohibiting inline optimization I have some data from gprof but library time was not accounted (it sees 0.72 seconds of 2 second runtime):
$ cat *cpp
#include <cstdlib>
#include <ctime>
#include <iostream>
int rand_wrapper1()
{
return std::rand();
}
int rand_scale1()
{
return rand_wrapper1() / ( double ) RAND_MAX;
}
int main()
{
std::srand( std::time( 0 ) );
double sum = 0.0;
for ( int i = 0; i < 100000000; ++i )
sum+= rand_scale1();
// sum += std::rand() / ( double ) RAND_MAX;
std::cout << sum << '\n';
return 0;
}
$ g++ -Wall -O3 -fno-inline -g -pg -o gprof_test gprof_test.cpp && time ./gprof_test
real 0m2.345s
$ gprof -b gprof_test gmon.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
80.02 0.57 0.57 rand_scale1()
19.29 0.71 0.14 100000000 1.37 1.37 rand_wrapper1()
2.14 0.72 0.02 frame_dummy
0.00 0.72 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z13rand_wrapper1v
0.00 0.72 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) [clone .constprop.0]
Call graph
granularity: each sample hit covers 2 byte(s) for 1.39% of 0.72 seconds
index % time self children called name
<spontaneous>
[1] 97.9 0.57 0.14 rand_scale1() [1]
0.14 0.00 100000000/100000000 rand_wrapper1() [2]
-----------------------------------------------
0.14 0.00 100000000/100000000 rand_scale1() [1]
[2] 19.0 0.14 0.00 100000000 rand_wrapper1() [2]
And perf sees all parts:
$ perf record ./gprof_test
0
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.388 MB perf.data (~16954 samples) ]
$ perf report |more
# Samples: 9K of event 'cycles'
# Event count (approx.): 7373484231
#
# Overhead Command Shared Object Symbol
# ........ .......... ................. .........................
#
25.91% gprof_test gprof_test [.] rand_scale1()
21.65% gprof_test libc-2.19.so [.] __mcount_internal
13.88% gprof_test libc-2.19.so [.] _mcount
12.54% gprof_test gprof_test [.] main
9.35% gprof_test libc-2.19.so [.] __random_r
8.40% gprof_test libc-2.19.so [.] __random
3.97% gprof_test gprof_test [.] rand_wrapper1()
2.79% gprof_test libc-2.19.so [.] rand
1.41% gprof_test gprof_test [.] mcount@plt
0.03% gprof_test [kernel.kallsyms] [k] memset
Upvotes: 0