Reputation: 9509
There is a loop in my code that iterates 100 million times (required for 100 million replications of a simulation model). For each of the 100 million iterations, I retrieve a value from an array (myarray
) by indexing on an integer variable named age
. Due to the length of the array, it is only valid to index myarray[age]
for age=0,...,99
. However, the actual domain of age
is 0,...,inf
.
So, I have the following function
int tidx(const int& a) {
return std::min(a,99);
}
which allows indexing by myarray[tidx(age)]
.
How could I do this more efficiently?
[PERF OUTPUT BELOW]
An example of building a source file that illustrates the compiler flags I'm using:
Building file: ../SAR.cpp
Invoking: GCC C++ Compiler
g++ -O3 -Wall -c -fmessage-length=0 -Wno-sign-compare -fopenmp -MMD -MP -MF"SAR.d" -MT"SAR.d" -o"SAR.o" "../SAR.cpp"
Finished building: ../SAR.cpp
From perf record
followed by perf report
:
Samples: 280 of event 'cycles', Event count (approx.): 179855989
24.78% pc2 libc-2.17.so [.] __GI_____strtod_l_internal
11.35% pc2 pc2 [.] samplePSA(int, double, int, NRRan&)
6.67% pc2 libc-2.17.so [.] str_to_mpn.isra.0
6.15% pc2 pc2 [.] simulate4_NEJMdisutilities(Policy&, bool)
5.68% pc2 pc2 [.] (anonymous namespace)::stateTransition(double const&, int const&, int&, double const&, bool const&, bool&, bo
5.25% pc2 pc2 [.] HistogramAges::add(double const&)
3.73% pc2 libstdc++.so.6.0.17 [.] std::istream::getline(char*, long, char)
3.02% pc2 libstdc++.so.6.0.17 [.] std::basic_istream<char, std::char_traits<char> >& std::operator>><char, std::char_traits<char> >(std::basic_
2.49% pc2 [kernel.kallsyms] [k] 0xffffffff81043e6a
2.29% pc2 libc-2.17.so [.] __strlen_sse2
2.00% pc2 libc-2.17.so [.] __mpn_lshift
1.72% pc2 libstdc++.so.6.0.17 [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast(long, __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::
1.71% pc2 libc-2.17.so [.] __memcpy_ssse3_back
1.67% pc2 libstdc++.so.6.0.17 [.] std::locale::~locale()
1.65% pc2 libc-2.17.so [.] __mpn_construct_double
1.38% pc2 libc-2.17.so [.] memchr
1.29% pc2 pc2 [.] (anonymous namespace)::readTransitionMatrix(double*, std::string)
1.27% pc2 libstdc++.so.6.0.17 [.] std::string::_M_mutate(unsigned long, unsigned long, unsigned long)
1.15% pc2 libc-2.17.so [.] round_and_return
1.02% pc2 libc-2.17.so [.] __mpn_mul
1.01% pc2 libstdc++.so.6.0.17 [.] std::istream::sentry::sentry(std::istream&, bool)
1.00% pc2 libc-2.17.so [.] __memcpy_sse2
0.85% pc2 libstdc++.so.6.0.17 [.] std::locale::locale(std::locale const&)
0.85% pc2 libstdc++.so.6.0.17 [.] std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long)
0.83% pc2 libstdc++.so.6.0.17 [.] std::locale::locale()
0.73% pc2 libc-2.17.so [.] __mpn_mul_1
From perf stat
:
Performance counter stats for './release/pc2':
62.449034 task-clock # 0.988 CPUs utilized
49 context-switches # 0.785 K/sec
3 cpu-migrations # 0.048 K/sec
861 page-faults # 0.014 M/sec
179,240,478 cycles # 2.870 GHz
58,909,298 stalled-cycles-frontend # 32.87% frontend cycles idle
<not supported> stalled-cycles-backend
320,437,960 instructions # 1.79 insns per cycle
# 0.18 stalled cycles per insn
70,932,710 branches # 1135.850 M/sec
697,468 branch-misses # 0.98% of all branches
0.063228446 seconds time elapsed
I would appreciate any comments. I need to learn how to interpret/read this information, so any tips that might help me get started would be appreciated.
Upvotes: 4
Views: 7571
Reputation: 40679
The first mistake lots of people make is to eyeball the code, home in on something or other, and wonder if they can make it faster.
The second mistake is to run gprof
on it, hoping to find a "bottleneck".
The only helpful thing gprof
can reliably find is if your code is CPU-bound, and it is CPU-bound in code that you compiled.
It is not good at finding problems of function calls that could be done away with.
It is not good at finding problems of I/O that you didn't know you were doing.
Plenty of people use this method, and here's why it works.
Upvotes: 3
Reputation: 88185
Remember to profile an executable with optimization turned on. For performance testing running a non-optimized executable is useless, as it will have completely different performance characteristics.
Then consider what you can do just to avoid so many lookups. Doing less work (algorithmic improvements) will take less time.
Also write code without 'premature pesimization', as Herb Sutter likes to call it.
Definition: Premature pessimization is when you write code that is slower than it needs to be, usually by asking for unnecessary extra work, when equivalently complex code would be faster and should just naturally flow out of your fingers.
For example inlining may or may not help, but you may want to write the code so as not to preclude inlining. Later you can force or prevent inlining and compare what's better for your execution environment. You also probably don't want to use references to small types like ints. Absent optimization, a reference parameter will probably be implemented with a pointer, which today is usually larger and slower than an int. Even on 32bit hardware a reference won't be faster than an int.
int tidx(int a) {
return std::min(a,99);
}
Then you can try some other optimization techniques; do independent tasks run in parallel? Do your data structures have good locality of reference characteristics? If things to run in parallel, are you being affected by false sharing? Can you use SIMD or other data parallelization? You can also play with compiler settings or enable specific optimizations in particular parts of the code. This is where testing performance becomes really important because different hardware can have radically different behavior. Also since most of these kinds of optimizations obfuscate code you don't want to pay that price for little or nothing in return.
Upvotes: 2
Reputation: 248039
A couple of ideas to look into
But really, it is an extremely simple function. There's not much to be gained just by looking at its implementation. Most of my suggestions involve how the function is called, where it is called from, when it is called, and what data it is called on. Those are likely what you need to look at.
Upvotes: 2
Reputation: 129374
Good advice already given about profiling.
min<T>(T x, T y)
should be the equivalent of return (x < y)?x:y;
As assembler, that would become:
mov x, %eax
cmp y, %eax
cmovg y, %eax
or something along those lines. These three instructions should definitely inline if you enable -O2 in gcc.
Upvotes: 2
Reputation:
In order to optimize the code one must first figure out what place is a bottleneck. To find a bottleneck, the code must be profiled. Otherwise changes are that a lot of time will be just wasted doing micro/false-optimizations that don't matter at all.
I haven't used a profiler on your minimal working code example (which you haven't provided), but based on my experience I can tell you this — your tidx()
function is not a bottleneck and you should not be concerned with the performance of std::min()
in this case. It is a lot more likely that the bottleneck is the memory access and stalled CPU cycles.
For starters, try unrolling your loop if possible (and if compiler haven't done that for you). It might be more efficient to perform 25000000 iterations rather than 100000000, but do more in a single loop cycle. But before you do that, you must make sure that unrolling the loop helps and not hurts. That is usually done by profiling, so we get back to the point that in order to optimize the code one must first figure out what place is a bottleneck. To find a bottleneck... Oh, wait, I almost got into infinitive loop here. Aborting.
Upvotes: 11