gcccompiler-constructioncompiler-optimizationinliningnative-aot

Reputation: 586

Why don't compilers inline everything, analyze it and then generate their own optimized functions?

Problem summary

I'm trying to do an in-memory database with transactions. I'm reaching bottlenecks on compiler level, specifically inlining limit. I don't have much knowledge about compilers and I would like to know why some things are done the way they are.

The goal

The absolute priority of my in-memory database is time performance. It must be super fast, this is the goal. I want to have everything in shared memory. Every database access is direct access to memory. To resolve race-conditions and transactions, spinlocks are implemented on memory level.

Expected result

When I have this pseudo code:

var garage = DB.GetGarage(123);
var car = DB.CreateCar();
car.Color = 2;
garage.ConcurrentBag.Add(car);

For implementation of all these (automatically generated) methods GetGarage, CreateCar, ConcurrentBad.Add I have inlining enabled. Why? I found out that it is faster.

But the reason, why it is faster, is probably not the overhead of calling the function. It seems that when it is inlined, compiler can figure out better machine code than when it is not inlined. In other words, these methods have common things which the compiler probably can simplify, but only if they are inlined, otherwise it cannot simplify them.

Now I'm coming to the merit of the problem. So I make everything inline and I expect the compiler will inline everything.

Actual result

Well, that is not how it works. Compiler may not inline and and lets be specific here:

.NET C# - AggressiveInlining - The attribute might cause implementation limits to be encountered that will result in slower generated code.

C99, C++ - inline - C++ and C99, but not its predecessors K&R C and C89, have support for inline functions, though with different semantics. In both cases, inline does not force inlining; the compiler is free to choose not to inline the function at all, or only in some cases.

What I have tried?

I have tried with C# and reached the limit. After that limit, everything was not inlined. I think something similar would happen with GCC:

max-inline-insns-single: Several parameters control the tree inliner used in gcc. This number sets the maximum number of instructions (counted in GCC's internal representation) in a single function that the tree inliner will consider for inlining. This only affects functions declared inline and methods implemented in a class declaration (C++). The default value is 500.

at least with GCC I can choose the limit, with .NET I cannot.

Why compilers doesn't inline everything, analyze it and then generate their own optimized functions?

I would really like to have a feedback on this question please. I know that not everything can be inlined (for example recursive calls). Lets skips cases where this is not possible at all.

I also know that inlining is not gauranteed to have better performance. But I think that all these issues mentioned in the link could be negated by compiler generated functions.

I also know that when inlining, several factors are considered:

We will measure the quality of inlining along three axes: the time spent generating code (aka throughput -- TP) abbreviate as TP), the time spent executing the code (aka code quality -- CQ), and the size of the generated code (CS).

I think that the reason why compilers don't do this could be the time spent generating the code. But what if I don't care... okay I don't want to wait a year, but I can wait a day if I get 20% faster code.

What do you think about it? Is there any compiler for any programming language which can do this (via some flags, or something like that)?

EDIT: According to @RaymondChen (see comments) it is similar to 'inline everything and then have another step to de-inline things':

De-inlining (also known as "common subexpression elimination") is something compilers already do.

But according to my research, CSE doesn't involve generating a new function, but rather use of saved data:

Common-subexpression elimination is a transformation that removes the recomuptations of common subexpressions and replaces them with uses of saved data.

I cannot find anything about compiler generated functions other than some related to C++ class constructors, destructors and operators. So, I'm still looking for an answer and hope somebody can provide some sources.

@RaymondChen also mentions:

finding de-inlining opportunities becomes harder the bigger the code being analyzed. The number of things to check grows (naively) as the fourth power of the code size. The time required for a large program will probably exceed your human lifetime, and the compiler itself will run out of memory long before then.

This could be a good to answer my question, but it is also something I'm struggling to accept. If finding de-inlining opportunities for large program by compiler will take more than human lifetime, how is that possible that I as a human can do it by myself in reasonable time, just by looking at the (high level, not machine) code and refactoring.

I understand that some tasks (pattern recognition, language translation, etc.) are really hard for computers to do. But today, we have neural networks. Would it be possible to use neural network for such a thing as finding de-inlining opportunities?

@PeterCordes mentions:

(in real-world compilers which don't try to re-factor straight-line code back into functions or loops)

I'm again asking why? I'm sure the compiler can figure out better functions than me. Why the compiler just take my functions and at most optimize or inlines them, but never (except from C++ constructor, destructor, etc.) generates a new one?

Upvotes: 6

Answers (3)

Useless

Reputation: 67802

You suggest that the compiler should inline everything, and then de-inline certain functions (possibly ones it synthesized itself).

But what criteria should it use for de-inlining? It doesn't know what your data look like at runtime, or which paths you want to optimize for.

The compiler can tell when your inlined super function exceeds the instruction cache size, but it doesn't have much information about which code can be extracted again without slowing everything down.

The information needed to do this is:

Knowing which paths need to be fast and which don't

This knowledge can come from your deep understanding of the problem domain, and you can communicate it by manually splitting code into inline and non-inline functions, annotating with [[likely]], [[unlikely]], [[gnu::noinline]], [[gnu::flatten]], [[gnu::always_inline]], [[gnu::hot]], [[gnu::cold]], etc.

Or it can come from running a profiled build with actual data and then using profile-guided optimization.

Just to clarify a technical detail that was picked up in comments:

instructions are data, and all data access latencies are heavily affected by cache behaviour. See this question for lots of detail, but briefly, inlining a function in, say, two places means you have 2 copies of the code.

benefits:
1. both inline copies avoid some function call overhead
2. both inline copies can be more aggressively optimized for their specific calling context
costs:
1. both copies occupy different memory addresses, increasing the opportunities for a cache miss when executing the function
2. equivalently, if the function code is in cache, it must have expelled something else, and that will cause a cache miss elsewhere.

In a non-trivial program, we'd expect at least some functions to be better not inlined. There's no way for the compiler to know for certain which functions though.

Upvotes: 3

user1576055

Reputation: 586

I'm going to summarize an answer, although I won't mark it as accepted because I'm still looking for a better one.

Compilers don't inline everything, because of L1/2/3 cache sizes, which are still small these days. My L1 is 960kB (Core i7-12700k), not much for a large in-lined program. Loading code outside of cache slows down the program and negates the benefits of inlining.

@PeterCordes: Just for the record since it didn't get mentioned explicitly, instruction-cache locality is the main performance reason for limits on inlining ... Modern desktops/servers have lots of RAM, but L1i and L2 caches are still small.

Compilers don't inline everything and then analyze it, because analyzing a large in-lined program and looking for common parts is very time consuming. It remains a question whether AI could be used for that.

@RaymondChen: The number of things to check grows (naively) as the fourth power of the code size.

Compilers don't generate their own optimized functions, because they rely on the programmer knowing which paths need to be fast and which don't.

@Useless: This knowledge can come from your deep understanding of the problem domain, and you can communicate it by manually splitting code into inline and non-inline functions

I'll allow myself a little reflection at the end. Every program can be written in an infinite number of ways. If you think about it, a function is just a separation of code. Adding a function instead of inlining doesn't change the logic of the code. You can add (de-inline) as many functions as you like, if they are equivalent to inline code, the logical result doesn't change.

What changes is the performance. Today's compilers rely on programmers to design functions, even though functions are not needed* for application logic**. Functions are needed for performance reasons, and compilers have very limited ability to refactor (due to time) or inline (due to cache size) them. But instead of adopting often poorly designed functions, I think a better solution is needed.

*They are needed to make code readable and maintainable, but this is something we don't care about on machine code level.

**With some exceptions, for example recursion and library invocation.

Upvotes: 1

Barsonax

Reputation: 310

While inlining is a powerful tool to optimize programs as it unlocks additional optimizations at the call site it's still a trade-off as it increases program size.

This then increased program size may have a negative impact on performance as it can reduce locality. Locality is also important because computer memory is much quicker if the data is close to each other in memory.

Some compilers even have flags you can set to optimize for code size or for performance.

TLDR: inlining is not a one size fits all solution.

Upvotes: -1

Why don&#39;t compilers inline everything, analyze it and then generate their own optimized functions?