Henrik.H
Henrik.H

Reputation: 13

Local variable not aligned in inline function

While programming with Intrinsics the following issue came up. When I want to load or store a local variable, in an inlined function then I got memory violation error, but only if the function is inlined. I have no idea why in inline function the stack variables are not aligned.

I have tested that with many different versions of GCC 4.9, 5.3, 6.1.

Example that failed:

static inline foo(double *phi){
  double localvar[4];
  __m256d var = _mm256_load_pd (phi);
  __m256d res = _mm256_mul_pd(var, var);
  _mm256_store_pd (localvar, res); // <-  failed due to memory violation
  ...
}

If I add __attribute__ ((aligned (32))) or remove inline then the function works like it should.

So can someone explain me (please in detail), why local variables in general are aligned without add __attribute__ ((aligned (32))) and local variables in inline function not?

Upvotes: 0

Views: 990

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364039

Providing 32-byte alignment costs extra instructions (because the ABI only guarantees 16-byte alignment; just look at the asm for the version with alignas(32) or __attribute__((aligned(32)))). Of course the compiler doesn't do it if you don't ask for it, because it's not free. (See also gcc's -mpreferred-stack-boundary which controls this, and the tag wiki for links to ABI docs).

double localvar[4]; only needs to be 8-byte aligned for each element to be naturally aligned. The SysV x86-64 ABI does guarantee 16-byte alignment for C99 variable-size arrays. I'm not sure if normal compile-time-constant sized arrays get 16-B alignment by default or not.

However, current versions of gcc for some reason align the stack to 32B in a test function that has __m256d local variables. At -O3 it doesn't spill them to the stack, so they're wasted (other than making buggy code like this happen to work). The fact that gcc doesn't remove this stuff is a missed-optimation. (It's needed at -O0 where gcc does spill everything to memory.)

Since my version of your test function (which actually compiles) doesn't have any other locals, the array of doubles is also 32B-aligned. Presumably you're inlining it into a caller that has some other locals, and that leads to different alignment for the array.

Here's the code on the Godbolt compiler explorer:

extern void use_buffer(double*);
// static inline
void no_alignment(const double *phi){
  double localvar[4];
  __m256d var = _mm256_load_pd (phi);
  __m256d res = _mm256_mul_pd(var, var);
  _mm256_storeu_pd (localvar, res);         // use an unaligned store since we didn't request alignment for the buffer
  use_buffer(localvar);
}

    lea     r10, [rsp+8]                 // save old RSP (in a clumsy way)
    and     rsp, -32                     // truncate RSP to the next 32B boundary
    push    QWORD PTR [r10-8]            // save more stuff
    push    rbp
    mov     rbp, rsp
    push    r10
    sub     rsp, 40
    ...         vmovupd YMMWORD PTR [rbp-48], ymm0     ...   // function body
    add     rsp, 40
    pop     r10
    pop     rbp
    lea     rsp, [r10-8]

This is why your code happens to work when it's not inlined. Although it's strange that it doesn't get inlined anyway, even without the inline keyword, unless you compiled without optimization or you didn't use static to let the compiler know that a separate definition wasn't needed.

Upvotes: 3

chasep255
chasep255

Reputation: 12175

_mm256_store_pd requires that the memory address you are storing to must be aligned to a 32 byte boundary. However in C I only think the standard alignment for and 8 byte double is an 8 byte boundary.

If I had to guess when the function is not inlined it starts the localvar array on a 32 byte boundary. I am not sure if this is a guarantee or just luck. I am guessing luck because inlining a function in theory should not change anything. The compiler may be pushing just the right number of bytes onto the stack so that it becomes aligned. Also I see no reason why it would guarantee a 32 byte alignment.

When it is inlined it would act as if the code was just typed where you are calling the function. Therefore you are only guaranteed that localvar will be 8 byte aligned rather than the guaranteed 32 byte alignment. I think the proper solution is to use the aligned attribute which solves your problem. You could also use the _mm256_storeu_pd intrinsic which does the same thing without the alignment requirement. From my experience with my haswell CPU it is just as fast.

Upvotes: 2

Related Questions