Better not trust the gcc default inliner?

Question

I was doing some hand optimizing some of my code and got bitten by gcc somehow.

The original code when run through a test takes about 3.5 seconds to finish execution.

I was confused why my optimized version now needs about 4.3 seconds to finish the test?

I applied __attribute__((always_inline)) to one of the local static functions that sticked out in the profiler and now it proudly runs within 2.9 seconds. Nice.

I've always trusted gcc to make the decision in function inlining, but apparently it doesn't seem really that perfect. I don't understand why gcc ended up with a very wrong decision whether or not to inline a file-scope static function with -O3 -flto -fwhole-program. Is the compiler really just doing a guestimate to the cost-benefits of inlining a function?

Mats Petersson · Accepted Answer

Edit: To answer the ACTUAL question, yes, the compiler does indeed "guesstimate" - or as the technical term is, it uses "heuristics" - to determine the gain in speed vs. space that inlining a particular function will result in. Heuristics is defined as "a practical but not theoretically perfect solution". End Edit.

Without seeing the code it's hard to say what is going on in the compiler. You are doing the right thing to profile your code, try your hand-optimisations and profile again - if it's better, keep it!

It's not that unusual that compilers get it wrong from time to time. Humans are more clever at times - but I would generally trust the compiler to get it right. It could be that the function is called many times and rather large, and thus the compiler decides that "it's above the threshold for code-bloat vs. speed gain"? Or it could be that it just doesn't get the "how much better/worse is it to inline" computation right.

Remember that the compiler is meant to be generic, and what works for one case may make another case worse - so the compiler has to compromise and come up with some reasonable heuristics that doesn't give too bad results too often.

If you can run profile-guided optimisation, it may help the compiler make the right decision (as it will know how many iterations and how often a particular branch is taken)...

If you can share the code with the GCC compiler team, report it as a bug - they may ignore/reject it as "too special" or some such, but it's quite possible that this particular case is something "that got missed out".

I think it's fair to say that the compiler "gets it right more often than not", but it doesn't mean it ALWAYS gets it right. I recently looked at some generated code from Clang, and it a whole bunch of extra instructions to unroll the loop - but in the most typical case, the loop would be one iteration, and never more than 16. So the additional instructions to unroll the loop by a factor of 4 was completely wasted for the case of one, and fairly useless even for the longest possible loop. The natural loop "rolled" would just be about 3-4 instructions, so the saving was quite small even if the loop was a lot bigger - but of course, had it been a million iterations, it would probably have tripled the speed of that function.

Better not trust the gcc default inliner?

Answers (1)

Related Questions