Reputation: 35
(IMO) Cons when not inlined:
call
is unconditional branch (jump) with stack manipulation, and the x86 instruction table specifies it needs 4 uops/inst. and has reciprocal throughput of 2 for Zen4 uarch. The function requires "argument passing" upon each function call, and the function itself requires "fixed" prologue and epilogue for its own stack and registers, while the inlined function may reduce the argument passing as well as prologue/epilogue.
Non-static
version of function cannot be evaluated at compile time when used by other "module" (executable or library).
If the compiler decides to not to inline a static inline
function, the compiler has to generate separate object code and link to it statically. Each separate object requires own phyiscal memory when loaded, not like extern
functions where all processes uses it on a single read-only shared memory region.
When the code inside a function is, when compiled, "smaller" (both in code size and overhead) then the function calling cost listed in 1., then it is generally good to force inlining (except if that code needs to be link dynamically all the time), like:
static __always_inline __attribute((sysv_abi))
int evaluate_this_program_quality() {
/*
* `mov eax, 0` has 5 bytes, and `call evaluate_this_program_quality@PLT`
* is also 5 bytes.
*/
return 0;
}
(IMO) Cons when inlined:
For modern processors, due to deeply pipelined main memory like DDR5 compared to old ones, and due to the fact that internal timing constraints of modern processors get significantly relived as the fabrication improves, the ratio of cache miss latency in whole execution gets increased significantly. For example, there are many recent reports that compiling -Os
option is better at time-wise performance compared to -O2
, while higher than -O3
is always worse due to increase code footprint. Also, nowadays there is LTO (link-time optimization) that can inline codes (+ etc.) within the other compiled objects. Then, should I not use inline
hints at all?
For example:
/* Inside library's source code (*.c) */
static thread_local pid_t _tid __attribute__((tls_model("initial-exec")));
#define likely(expr) __builtin_expect(!!(expr), 1)
pid_t fast_gettid() { return likely(_tid) ? _tid : (_tid = gettid()); }
/* Inside library's public header (*.h) */
pid_t fast_gettid();
/* Inside library's source code (*.c) */
thread_local pid_t _tid __attribute__((tls_model("initial-exec")));
/* Inside library's public header (*.h) */
extern thread_local pid_t _tid;
#define likely(expr) __builtin_expect(!!(expr), 1)
static __always_inline pid_t fast_gettid() { return likely(_tid) ? _tid : (_tid = gettid()); }
How "small" code is actually good to be used as static __always_inline
function? Is fast_gettid()
"small" enough, or not?
Another example:
static __always_inline __attribute((pure)) const char *
filename(const char *restrict path) {
if (path) {
const char *const restrict __basename = __builtin_strrchr(path, '/');
return likely(__basename) ? __basename + 1 : path;
}
return path;
}
Is this function "small" enough for modern processor? Should I separate constant-exprssion version? Should I maintain this as __always_inline
, or change to just inline
and let the compiler to decide, or remove inline hint and let LTO optimize it?
It's been a while I have searched a deterministic (programmatic) way to decide upon this issues, or at least well-known, best-effort principles regarding these. I know the answer will depend on circumstances, but I just cannot bear myself without knowing the baseline opinion.
Upvotes: 1
Views: 103