synaptik
synaptik

Reputation: 9509

Efficiency of std::min(int) in c++

There is a loop in my code that iterates 100 million times (required for 100 million replications of a simulation model). For each of the 100 million iterations, I retrieve a value from an array (myarray) by indexing on an integer variable named age. Due to the length of the array, it is only valid to index myarray[age] for age=0,...,99. However, the actual domain of age is 0,...,inf.

So, I have the following function

int tidx(const int& a) {
    return std::min(a,99);
}

which allows indexing by myarray[tidx(age)].

How could I do this more efficiently?

[PERF OUTPUT BELOW]

An example of building a source file that illustrates the compiler flags I'm using:

Building file: ../SAR.cpp
Invoking: GCC C++ Compiler
g++ -O3 -Wall -c -fmessage-length=0 -Wno-sign-compare -fopenmp -MMD -MP -MF"SAR.d" -MT"SAR.d" -o"SAR.o" "../SAR.cpp"
Finished building: ../SAR.cpp

From perf record followed by perf report:

Samples: 280  of event 'cycles', Event count (approx.): 179855989                                                                                   
 24.78%  pc2  libc-2.17.so         [.] __GI_____strtod_l_internal
 11.35%  pc2  pc2                  [.] samplePSA(int, double, int, NRRan&)
  6.67%  pc2  libc-2.17.so         [.] str_to_mpn.isra.0
  6.15%  pc2  pc2                  [.] simulate4_NEJMdisutilities(Policy&, bool)
  5.68%  pc2  pc2                  [.] (anonymous namespace)::stateTransition(double const&, int const&, int&, double const&, bool const&, bool&, bo
  5.25%  pc2  pc2                  [.] HistogramAges::add(double const&)
  3.73%  pc2  libstdc++.so.6.0.17  [.] std::istream::getline(char*, long, char)
  3.02%  pc2  libstdc++.so.6.0.17  [.] std::basic_istream<char, std::char_traits<char> >& std::operator>><char, std::char_traits<char> >(std::basic_
  2.49%  pc2  [kernel.kallsyms]    [k] 0xffffffff81043e6a
  2.29%  pc2  libc-2.17.so         [.] __strlen_sse2
  2.00%  pc2  libc-2.17.so         [.] __mpn_lshift
  1.72%  pc2  libstdc++.so.6.0.17  [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast(long, __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::
  1.71%  pc2  libc-2.17.so         [.] __memcpy_ssse3_back
  1.67%  pc2  libstdc++.so.6.0.17  [.] std::locale::~locale()
  1.65%  pc2  libc-2.17.so         [.] __mpn_construct_double
  1.38%  pc2  libc-2.17.so         [.] memchr
  1.29%  pc2  pc2                  [.] (anonymous namespace)::readTransitionMatrix(double*, std::string)
  1.27%  pc2  libstdc++.so.6.0.17  [.] std::string::_M_mutate(unsigned long, unsigned long, unsigned long)
  1.15%  pc2  libc-2.17.so         [.] round_and_return
  1.02%  pc2  libc-2.17.so         [.] __mpn_mul
  1.01%  pc2  libstdc++.so.6.0.17  [.] std::istream::sentry::sentry(std::istream&, bool)
  1.00%  pc2  libc-2.17.so         [.] __memcpy_sse2
  0.85%  pc2  libstdc++.so.6.0.17  [.] std::locale::locale(std::locale const&)
  0.85%  pc2  libstdc++.so.6.0.17  [.] std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long)
  0.83%  pc2  libstdc++.so.6.0.17  [.] std::locale::locale()
  0.73%  pc2  libc-2.17.so         [.] __mpn_mul_1

From perf stat:

 Performance counter stats for './release/pc2':

         62.449034 task-clock                #    0.988 CPUs utilized          
                49 context-switches          #    0.785 K/sec                  
                 3 cpu-migrations            #    0.048 K/sec                  
               861 page-faults               #    0.014 M/sec                  
       179,240,478 cycles                    #    2.870 GHz                    
        58,909,298 stalled-cycles-frontend   #   32.87% frontend cycles idle   
   <not supported> stalled-cycles-backend  
       320,437,960 instructions              #    1.79  insns per cycle        
                                             #    0.18  stalled cycles per insn
        70,932,710 branches                  # 1135.850 M/sec                  
           697,468 branch-misses             #    0.98% of all branches        

       0.063228446 seconds time elapsed

I would appreciate any comments. I need to learn how to interpret/read this information, so any tips that might help me get started would be appreciated.

Upvotes: 4

Views: 7571

Answers (5)

Mike Dunlavey
Mike Dunlavey

Reputation: 40679

The first mistake lots of people make is to eyeball the code, home in on something or other, and wonder if they can make it faster.

The second mistake is to run gprof on it, hoping to find a "bottleneck".

The only helpful thing gprof can reliably find is if your code is CPU-bound, and it is CPU-bound in code that you compiled. It is not good at finding problems of function calls that could be done away with. It is not good at finding problems of I/O that you didn't know you were doing.

Plenty of people use this method, and here's why it works.

Upvotes: 3

bames53
bames53

Reputation: 88185

Remember to profile an executable with optimization turned on. For performance testing running a non-optimized executable is useless, as it will have completely different performance characteristics.

Then consider what you can do just to avoid so many lookups. Doing less work (algorithmic improvements) will take less time.

Also write code without 'premature pesimization', as Herb Sutter likes to call it.

Definition: Premature pessimization is when you write code that is slower than it needs to be, usually by asking for unnecessary extra work, when equivalently complex code would be faster and should just naturally flow out of your fingers.

For example inlining may or may not help, but you may want to write the code so as not to preclude inlining. Later you can force or prevent inlining and compare what's better for your execution environment. You also probably don't want to use references to small types like ints. Absent optimization, a reference parameter will probably be implemented with a pointer, which today is usually larger and slower than an int. Even on 32bit hardware a reference won't be faster than an int.

int tidx(int a) {
    return std::min(a,99);
}

Then you can try some other optimization techniques; do independent tasks run in parallel? Do your data structures have good locality of reference characteristics? If things to run in parallel, are you being affected by false sharing? Can you use SIMD or other data parallelization? You can also play with compiler settings or enable specific optimizations in particular parts of the code. This is where testing performance becomes really important because different hardware can have radically different behavior. Also since most of these kinds of optimizations obfuscate code you don't want to pay that price for little or nothing in return.

Upvotes: 2

Stack Overflow is garbage
Stack Overflow is garbage

Reputation: 248039

A couple of ideas to look into

  • check the generated asssembly for the function. It may or may not compile to something suboptimal
  • call the function fewer times
  • rewrite the function to use SSE instructions to compute the minimum for 4 (or 8 with AVX) values at a time
  • restructure how the function is called. Ideally, the calls shouldn't be so far apart that the code gets evicted from the instruction cache.
  • don't pass the int by const reference. What's the point in that? Just pass a copy of the int.
  • examine whether any aliasing or false sharing might slow you down at the call site.

But really, it is an extremely simple function. There's not much to be gained just by looking at its implementation. Most of my suggestions involve how the function is called, where it is called from, when it is called, and what data it is called on. Those are likely what you need to look at.

Upvotes: 2

Mats Petersson
Mats Petersson

Reputation: 129374

Good advice already given about profiling.

min<T>(T x, T y) should be the equivalent of return (x < y)?x:y;

As assembler, that would become:

mov    x, %eax
cmp    y, %eax
cmovg  y, %eax

or something along those lines. These three instructions should definitely inline if you enable -O2 in gcc.

Upvotes: 2

user405725
user405725

Reputation:

In order to optimize the code one must first figure out what place is a bottleneck. To find a bottleneck, the code must be profiled. Otherwise changes are that a lot of time will be just wasted doing micro/false-optimizations that don't matter at all.

I haven't used a profiler on your minimal working code example (which you haven't provided), but based on my experience I can tell you this — your tidx() function is not a bottleneck and you should not be concerned with the performance of std::min() in this case. It is a lot more likely that the bottleneck is the memory access and stalled CPU cycles.

For starters, try unrolling your loop if possible (and if compiler haven't done that for you). It might be more efficient to perform 25000000 iterations rather than 100000000, but do more in a single loop cycle. But before you do that, you must make sure that unrolling the loop helps and not hurts. That is usually done by profiling, so we get back to the point that in order to optimize the code one must first figure out what place is a bottleneck. To find a bottleneck... Oh, wait, I almost got into infinitive loop here. Aborting.

Upvotes: 11

Related Questions