Local variable not aligned in inline function

Question

While programming with Intrinsics the following issue came up. When I want to load or store a local variable, in an inlined function then I got memory violation error, but only if the function is inlined. I have no idea why in inline function the stack variables are not aligned.

I have tested that with many different versions of GCC 4.9, 5.3, 6.1.

Example that failed:

static inline foo(double *phi){
  double localvar[4];
  __m256d var = _mm256_load_pd (phi);
  __m256d res = _mm256_mul_pd(var, var);
  _mm256_store_pd (localvar, res); // <-  failed due to memory violation
  ...
}

If I add __attribute__ ((aligned (32))) or remove inline then the function works like it should.

So can someone explain me (please in detail), why local variables in general are aligned without add __attribute__ ((aligned (32))) and local variables in inline function not?

Peter Cordes · Accepted Answer

Providing 32-byte alignment costs extra instructions (because the ABI only guarantees 16-byte alignment; just look at the asm for the version with alignas(32) or __attribute__((aligned(32)))). Of course the compiler doesn't do it if you don't ask for it, because it's not free. (See also gcc's -mpreferred-stack-boundary which controls this, and the x86 tag wiki for links to ABI docs).

double localvar[4]; only needs to be 8-byte aligned for each element to be naturally aligned. The SysV x86-64 ABI does guarantee 16-byte alignment for C99 variable-size arrays. I'm not sure if normal compile-time-constant sized arrays get 16-B alignment by default or not.

However, current versions of gcc for some reason align the stack to 32B in a test function that has __m256d local variables. At -O3 it doesn't spill them to the stack, so they're wasted (other than making buggy code like this happen to work). The fact that gcc doesn't remove this stuff is a missed-optimation. (It's needed at -O0 where gcc does spill everything to memory.)

Since my version of your test function (which actually compiles) doesn't have any other locals, the array of doubles is also 32B-aligned. Presumably you're inlining it into a caller that has some other locals, and that leads to different alignment for the array.

Here's the code on the Godbolt compiler explorer:

extern void use_buffer(double*);
// static inline
void no_alignment(const double *phi){
  double localvar[4];
  __m256d var = _mm256_load_pd (phi);
  __m256d res = _mm256_mul_pd(var, var);
  _mm256_storeu_pd (localvar, res);         // use an unaligned store since we didn't request alignment for the buffer
  use_buffer(localvar);
}

    lea     r10, [rsp+8]                 // save old RSP (in a clumsy way)
    and     rsp, -32                     // truncate RSP to the next 32B boundary
    push    QWORD PTR [r10-8]            // save more stuff
    push    rbp
    mov     rbp, rsp
    push    r10
    sub     rsp, 40
    ...         vmovupd YMMWORD PTR [rbp-48], ymm0     ...   // function body
    add     rsp, 40
    pop     r10
    pop     rbp
    lea     rsp, [r10-8]

This is why your code happens to work when it's not inlined. Although it's strange that it doesn't get inlined anyway, even without the inline keyword, unless you compiled without optimization or you didn't use static to let the compiler know that a separate definition wasn't needed.

Local variable not aligned in inline function

Answers (2)

Related Questions