Deterministic (programmatic) guidline for function inlining when targetting modern x86 processors?

Question

(IMO) Cons when not inlined:

call is unconditional branch (jump) with stack manipulation, and the x86 instruction table specifies it needs 4 uops/inst. and has reciprocal throughput of 2 for Zen4 uarch. The function requires "argument passing" upon each function call, and the function itself requires "fixed" prologue and epilogue for its own stack and registers, while the inlined function may reduce the argument passing as well as prologue/epilogue.
Non-static version of function cannot be evaluated at compile time when used by other "module" (executable or library).
If the compiler decides to not to inline a static inline function, the compiler has to generate separate object code and link to it statically. Each separate object requires own phyiscal memory when loaded, not like extern functions where all processes uses it on a single read-only shared memory region.
When the code inside a function is, when compiled, "smaller" (both in code size and overhead) then the function calling cost listed in 1., then it is generally good to force inlining (except if that code needs to be link dynamically all the time), like:
```
static __always_inline __attribute((sysv_abi))
int evaluate_this_program_quality() {
    /*
     * `mov eax, 0` has 5 bytes, and `call evaluate_this_program_quality@PLT`
     * is also 5 bytes.
     */
    return 0; 
}
```

(IMO) Cons when inlined:

It increases code size and the overhead of instruction cache miss. Each invocation cannot be shared-cache, and if the target processor uses BTB (branch target buffer) or etc. for branch prediction, since the branching address is all different, it will affect the performance of branch prediction badly.

For modern processors, due to deeply pipelined main memory like DDR5 compared to old ones, and due to the fact that internal timing constraints of modern processors get significantly relived as the fabrication improves, the ratio of cache miss latency in whole execution gets increased significantly. For example, there are many recent reports that compiling -Os option is better at time-wise performance compared to -O2, while higher than -O3 is always worse due to increase code footprint. Also, nowadays there is LTO (link-time optimization) that can inline codes (+ etc.) within the other compiled objects. Then, should I not use inline hints at all?

For example:

/* Inside library's source code (*.c) */
static thread_local pid_t _tid __attribute__((tls_model("initial-exec")));
#define likely(expr) __builtin_expect(!!(expr), 1)
pid_t fast_gettid() { return likely(_tid) ? _tid : (_tid = gettid()); }
/* Inside library's public header (*.h) */
pid_t fast_gettid();

/* Inside library's source code (*.c) */
thread_local pid_t _tid __attribute__((tls_model("initial-exec")));
/* Inside library's public header (*.h) */
extern thread_local pid_t _tid;
#define likely(expr) __builtin_expect(!!(expr), 1)
static __always_inline pid_t fast_gettid() { return likely(_tid) ? _tid : (_tid = gettid()); }

How "small" code is actually good to be used as static __always_inline function? Is fast_gettid() "small" enough, or not?

Another example:

static __always_inline __attribute((pure)) const char *
filename(const char *restrict path) {
  if (path) {
    const char *const restrict __basename = __builtin_strrchr(path, '/');
    return likely(__basename) ? __basename + 1 : path;
  }
  return path;
}

Is this function "small" enough for modern processor? Should I separate constant-exprssion version? Should I maintain this as __always_inline, or change to just inline and let the compiler to decide, or remove inline hint and let LTO optimize it?

It's been a while I have searched a deterministic (programmatic) way to decide upon this issues, or at least well-known, best-effort principles regarding these. I know the answer will depend on circumstances, but I just cannot bear myself without knowing the baseline opinion.

Deterministic (programmatic) guidline for function inlining when targetting modern x86 processors?

Answers (0)

Related Questions