Anycorn
Anycorn

Reputation: 51435

Can the compiler optimize out the initialization of a static local variable?

what should be the behavior in the following case:

class C {
    boost::mutex mutex_;
    std::map<...> data_;
};

C& get() {
    static C c;
    return c;
}

int main() {
    get(); // is compiler free to optimize out the call? 
    ....
}

is compiler allowed to optimize out the call to get()?

the idea was to touch static variable to initialize it before multithreaded operations needed it

is this a better option?:

C& get() {
    static C *c = new C();
    return *c;
}

Upvotes: 5

Views: 2282

Answers (4)

Peter Cordes
Peter Cordes

Reputation: 363980

Your original code is safe. Don't introduce an extra level of indirection (a pointer variable that has to get loaded before the address of the std::map is available.)

As Jerry Coffin says, your code has to run as if it ran in source order. That includes running as-if it has constructed your boost or std::mutex and std::map before later stuff in main, such as starting threads.

Pre C++11, the language standard and memory model wasn't officially thread-aware, but stuff like this (thread-safe static-local initialization) worked anyway because compiler writers wanted their compilers to be useful. e.g. GCC 4.1 from 2006 (https://godbolt.org/z/P3sjo4Tjd) still uses a guard variable with to make sure a single thread does the constructing in case multiple calls to get() happen at the same time.

Now, with C++11 and later, the ISO standard does include threads and it's officially required for that to be safe.


Since your program can't observe the difference, it's hypothetically possible that a compiler could choose to skip construction now let it happen in the first thread to actually call get() in a way that isn't optimized away. That's fine, construction of static locals is thread-safe, with compilers like GCC and Clang using a "guard variable" that they check (read-only with an acquire load) at the start of the function.

A file-scope static variable would avoid the load+test/branch fast-path overhead of the guard variable that happens every call, and would be safe as long as nothing calls get() before the start of main(). A guard variable is pretty cheap especially on ISAs like x86, AArch64, and 32-bit ARMv8 that have cheap acquire loads, but more expensive on ARMv7 for example where an acquire load uses a dmb ish full barrier.

If some hypothetical compiler actually did the optimization you're worried about, the difference could be in NUMA placement of the page of .bss holding static C c, if nothing else in that page was touched first. And potentially stalling other threads very briefly in their first calls to get() if construction isn't finished by the time a second thread also calls get().


Current GCC and clang don't in practice do this optimization

Clang 17 with libc++ makes the following asm for x86-64, with -O3. (demangled by Godbolt). The asm for get() is also inlined into main. GCC with libstdc++ is pretty similar, really only differing in the std::map internals.

get():
        movzx   eax, byte ptr [rip + guard variable for get()::c]  # all x86 loads are acquire loads
        test    al, al                       # check the guard variable
        je      .LBB0_1
        lea     rax, [rip + get()::c]        # retval = address of the static variable
   # end of the fast path through the function.
   # after the first call, all callers go through this path.
        ret

 # slow path, only reached if the guard variable is zero
.LBB0_1:
        push    rax
        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_acquire@PLT
        test    eax, eax   # check if we won the race to construct c,
        je      .LBB0_3    # or if we waited until another thread finished doing it.

        xorps   xmm0, xmm0
        movups  xmmword ptr [rip + get()::c+16], xmm0     # first 16 bytes of std::map<int,int> = NULL pointers
        movups  xmmword ptr [rip + get()::c], xmm0        # std::mutex = 16 bytes of zeros
        mov     qword ptr [rip + get()::c+32], 0          # another NULL
        lea     rsi, [rip + get()::c]                     # arg for __cxa_atexit
        movups  xmmword ptr [rip + get()::c+48], xmm0     # more zeros, maybe a root node?
        lea     rax, [rip + get()::c+48]                  
        mov     qword ptr [rip + get()::c+40], rax        # pointer to another part of the map object

        lea     rdi, [rip + C::~C() [base object destructor]]  # more args for atexit
        lea     rdx, [rip + __dso_handle]
        call    __cxa_atexit@PLT                 # register the destructor function-pointer with a "this" pointer

        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_release@PLT          # "unlock" the guard variable, setting it to 1 for future calls
             # and letting any other threads return from __cxa_guard_acquire and see a fully-constructed object

.LBB0_3:                                     # epilogue
        add     rsp, 8
        lea     rax, [rip + get()::c]        # return value, same as in the fast path.
        ret

Even though the std::map is unused, constructing it involves calling __cxa_atexit (a C++-internals version of atexit) to register the destructor to free the red-black tree as the program exits. I suspect this is the part that's opaque to the optimizer and the main reason it doesn't get optimized like static int x = 123; or static void *foo = &bar; into pre-initialized space in .data with no run-time construction (and no guard variable).

Constant-propagation to avoid the need for any run-time initialization is what happens if struct C only includes std::mutex, which in GNU/Linux at least doesn't have a destructor and is actually zero-initialized. (C++ before C++23 allowed early init even when that included visible side-effects. This doesn't; compilers can still constant-propagate static int local_foo = an_inline_function(123); into some bytes in .data with no run-time call.)

GCC and Clang also don't optimize away the guard variable (if there's any run-time work to do), even though main doesn't start any threads at all, let alone before calling get(). A constructor in some other compilation unit (including a shared library) could have started another thread that called get() at the same time main did. (It's arguably a missed optimization with gcc -fwhole-program.)


If the constructors had any (potentially) visible side-effects, perhaps including a call to new since new is replaceable, compilers couldn't defer it because the C++ language rules say when the constructor is called in the abstract machine. (Compilers are allowed to make some assumptions about new, though, e.g. clang with libc++ can optimize away new / delete for an unused std::vector.)

Classes like std::unordered_map (a hash table instead of a red-black tree) do use new in their constructor.

I was testing with std::map<int,int>, so the individual objects don't have destructors with visible side-effects. A std::map<Foo,Bar> where Foo::~Foo prints something would make it matter when the static-local initializer runs, since that's when we call __cxa_atexit. Assuming destruction order happens in reverse of construction, waiting until later to call __cxa_atexit could lead to it being destructed sooner, leading to Foo::~Foo() calls happening too soon, potentially before instead of after some other visible side effect.

Or some other global data structure could maybe have references to the int objects inside a std::map<int,int>, and use those in its destructor. That wouldn't be safe if we destruct the std::map too soon.

(I'm not sure if ISO C++, or GNU C++, gives such ordering guarantees for sequencing of destructors. But if it does, that would be a reason compilers couldn't normally defer construction when it involves registering a destructor. And looking for that optimization in trivial programs isn't worth the cost in compile time.)


With file-scope static to avoid a guard variable

Notice the lack of a guard variable, making the fast path faster, especially for ISAs like ARMv7 that don't have a good way to do just an acquire barrier. https://godbolt.org/z/4bGx3Tasj -

static C global_c;     // It's not actually global, just file-scoped static

C& get2() {
    return global_c;
}
# clang -O3 for x86-64
get2():
      # note the lack of a load + branch on a guard variable
        lea     rax, [rip + global_c]
        ret

main:
      # construction already happened before main started, and we don't do anything with the address
        xor     eax, eax
        ret
# GCC -O3 -mcpu=cortex-a15     // a random ARMv7 CPU
get2():
        ldr     r0, .L81          @ PC-relative load
        bx      lr

@ somewhere nearby, between functions
.L81:
        .word   .LANCHOR0+52      @ pointer to struct C global_c

main:
        mov     r0, #0
        bx      lr

The constructor code that does the stores and calls __cxa_atexit still exists, it's just in a separate function called _GLOBAL__sub_I_example.cpp: (clang) or _GLOBAL__sub_I_get(): (GCC), which the compiler adds to a list of init functions to be called before main.

Function-scoped local vars are normally fine, the overhead is pretty minimal, especially on x86-64 and ARMv8. But since you were worried about micro-optimizations like when std::map was constructed at all, I thought it was worth mentioning. And to show the mechanism compilers use to make this stuff work under the hood.

Upvotes: 1

Jerry Coffin
Jerry Coffin

Reputation: 490018

Updated (2023) Answer:

In C++23 (N4950) any side effects of initializing a static local variable are observable as its containing block is entered. As such, unless the compiler can determine that initializing the variable has no visible side effects, it will have to generate code for to call get() at the appropriate time (or to execute an inlined version of get(), as the case may be).

Contrary to earlier standards, C++ 23 no longer gives permission for dynamic initialization of a static local variable to be done "early" (as discussed below).

[stmt.dcl]/3:

Dynamic initialization of a block variable with static storage duration (6.7.5.2) or thread storage duration (6.7.5.3) is performed the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization.

Original (2010) answer:

The C and C++ standards operate under a rather simple principle generally known as the "as-if rule" -- basically, that the compiler is free to do almost anything as long as no conforming code can discern the difference between what it did and what was officially required.

I don't see a way for conforming code to discern whether get was actually called in this case, so it looks to me like it's free to optimize it out.

At least as recently as N4296, the standard contained explicit permission to do early initialization of static local variables:

Constant initialization (3.6.2) of a block-scope entity with static storage duration, if applicable, is performed before its block is first entered. An implementation is permitted to perform early initialization of other block-scope variables with static or thread storage duration under the same conditions that an implementation is permitted to statically initialize a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization.

So, under this rule, initialization of the local variable could happen arbitrarily early in execution, so even if it has visible side effects, they're allowed to happen before any code that attempts to observed them. As such, you aren't guaranteed to see them, so optimizing it out is allowed.

Upvotes: 4

Chubsdad
Chubsdad

Reputation: 25487

Whether the compiler optimizes the function call or not is basically unspecified behavior as per the Standard. An unspecified behavior is basically a behavior which is chosen from a set of finite possibilities, but the choice may not be consistent every time. In this case, the choice is 'to optimize' or 'not', which the Standard does not specify and the implementation is also not supposed to document, as it is a choice which may not be consistently taken by a given implementation.

If the idea is just to 'touch', will it help if we just add a dummy volatile variable and dummy increment it in each call

e.g

C& getC(){
   volatile int dummy;
   dummy++;
   // rest of the code
}

Upvotes: 0

SingleNegationElimination
SingleNegationElimination

Reputation: 156138

Based on your edits, here's an improved version, with the same results.

Input:

struct C { 
    int myfrob;
    int frob();
    C(int f);
 };
C::C(int f) : myfrob(f) {}
int C::frob() { return myfrob; }

C& get() {
    static C *c = new C(5);
    return *c;
}

int main() {
    return get().frob(); // is compiler free to optimize out the call? 

}

Output:

; ModuleID = '/tmp/webcompile/_28088_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"

%struct.C = type { i32 }

@guard variable for get()::c = internal global i64 0            ; <i64*> [#uses=4]

declare i32 @__cxa_guard_acquire(i64*) nounwind

declare i8* @operator new(unsigned long)(i64)

declare void @__cxa_guard_release(i64*) nounwind

declare i8* @llvm.eh.exception() nounwind readonly

declare i32 @llvm.eh.selector(i8*, i8*, ...) nounwind

declare void @__cxa_guard_abort(i64*) nounwind

declare i32 @__gxx_personality_v0(...)

declare void @_Unwind_Resume_or_Rethrow(i8*)

define i32 @main() {
entry:
  %0 = load i8* bitcast (i64* @guard variable for get()::c to i8*), align 8 ; <i8> [#uses=1]
  %1 = icmp eq i8 %0, 0                           ; <i1> [#uses=1]
  br i1 %1, label %bb.i, label %_Z3getv.exit

bb.i:                                             ; preds = %entry
  %2 = tail call i32 @__cxa_guard_acquire(i64* @guard variable for get()::c) nounwind ; <i32> [#uses=1]
  %3 = icmp eq i32 %2, 0                          ; <i1> [#uses=1]
  br i1 %3, label %_Z3getv.exit, label %bb1.i

bb1.i:                                            ; preds = %bb.i
  %4 = invoke i8* @operator new(unsigned long)(i64 4)
          to label %invcont.i unwind label %lpad.i ; <i8*> [#uses=2]

invcont.i:                                        ; preds = %bb1.i
  %5 = bitcast i8* %4 to %struct.C*               ; <%struct.C*> [#uses=1]
  %6 = bitcast i8* %4 to i32*                     ; <i32*> [#uses=1]
  store i32 5, i32* %6, align 4
  tail call void @__cxa_guard_release(i64* @guard variable for get()::c) nounwind
  br label %_Z3getv.exit

lpad.i:                                           ; preds = %bb1.i
  %eh_ptr.i = tail call i8* @llvm.eh.exception()  ; <i8*> [#uses=2]
  %eh_select12.i = tail call i32 (i8*, i8*, ...)* @llvm.eh.selector(i8* %eh_ptr.i, i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*), i8* null) ; <i32> [#uses=0]
  tail call void @__cxa_guard_abort(i64* @guard variable for get()::c) nounwind
  tail call void @_Unwind_Resume_or_Rethrow(i8* %eh_ptr.i)
  unreachable

_Z3getv.exit:                                     ; preds = %invcont.i, %bb.i, %entry
  %_ZZ3getvE1c.0 = phi %struct.C* [ null, %bb.i ], [ %5, %invcont.i ], [ null, %entry ] ; <%struct.C*> [#uses=1]
  %7 = getelementptr inbounds %struct.C* %_ZZ3getvE1c.0, i64 0, i32 0 ; <i32*> [#uses=1]
  %8 = load i32* %7, align 4                      ; <i32> [#uses=1]
  ret i32 %8
}

Noteworth, no code is emitted for ::get, but main still allocates ::get::c (at %4) with a guard variable as needed (at %2 and at the end of invcont.i and lpad.i). llvm here is inlining all of that stuff.

tl;dr: Don't worry about it, the optimizer normally gets this stuff right. Are you seeing an error?

Upvotes: 3

Related Questions