Why does the compiler not always optimize away local variables?

Question

I am trying to understand if the removal of local intermediate variables could lead to better optimized code. Consider the following MWE, paying particular attention to the two functions f and g:

struct A {
    double d;
};

struct B {
    double s;
};

struct C {
    A a;
    B b;
};

A geta();
B getb();

C f() {
    const A a = geta();
    const B b = getb();

    C c;
    c.a = a;
    c.b = b;
    return c;
}

C g() {
    C c;
    c.a = geta();
    c.b = getb();
    return c;
}

Both f and g call geta() and getb() to populate an instance of class C which is then returned, but f uses two local intermediate variables to store the returned values of geta() and getb(), while g directly assigns the returned values to the members of c.

Compiling with gcc -O3, version 9.2, the binaries for the two functions f and g are exactly the same. However, adding another variable to either A or B class leads to different binaries. In particular, the binary for f has some more instructions. The same holds for clang v8.0.0 with -O3 flag.

What is happening here? Why is the compiler not able to optimize away the local intermediate variables of f when A or B get a little more complex? Isn't the code of f and g equivalent?

In addition, the behavior is not the same for MSVC v19.22 with /O2 flag: the compiler from Microsoft already has different binaries in the first case, i.e. with both classes A and B composed by a single double.

I am using Godbolt: you can find here the code which produces different binaries.

Peter Cordes · Accepted Answer

This is a missed optimization

Neither function takes the address of C c so escape analysis should easily prove it's a pure local that nothing else could have a pointer to. geta() and getb() can't be reading or writing that variable directly, therefore it's safe to store the geta() return value directly into c.a instead of a temporary on the stack.

Surprisingly GCC, clang, ICC, and MSVC all miss this optimization, most using call-preserved registers to hold the geta() return value until after getb(). https://godbolt.org/z/WQ9MAF At least for x86-64; I mostly didn't check other ISAs or older compiler versions.

Fun fact: clang 3.5 has this missed-optimization even for g(), defeating the source code's attempt to be efficient.

Fun fact #2: With GCC9.2, compiling as C instead of C++ makes GCC do a much worse job, deoptimizing g(). (I had to change to typedef struct Atag {...} A; but compiling that as C++ still optimizes g(). https://godbolt.org/z/_Y95nj)

clang8.0 produces an efficient g() with/without -xc. and ICC produces an inefficient g() either way.

ICC's f() is even worse than its g().

MSVC's g() is about efficient as you could hope for; the Windows x64 calling convention returns the struct by hidden pointer and MSVC never optimizes that to passing a pointer to its own return-value object. (Which it probably couldn't prove is safe anyway, if its own caller was also potentially doing such optimizations.)

Obviously if geta() and getb() can inline, that removes any doubt for the optimizer and it should do the optimization more easily / reliably.

Why does the compiler not always optimize away local variables?

Answers (1)

Related Questions