Reputation: 13
While programming with Intrinsics the following issue came up. When I want to load or store a local variable, in an inlined function then I got memory violation error, but only if the function is inlined. I have no idea why in inline function the stack variables are not aligned.
I have tested that with many different versions of GCC 4.9, 5.3, 6.1.
Example that failed:
static inline foo(double *phi){
double localvar[4];
__m256d var = _mm256_load_pd (phi);
__m256d res = _mm256_mul_pd(var, var);
_mm256_store_pd (localvar, res); // <- failed due to memory violation
...
}
If I add __attribute__ ((aligned (32)))
or remove inline
then the function works like it should.
So can someone explain me (please in detail), why local variables in general are aligned without add __attribute__ ((aligned (32)))
and local variables in inline function not?
Upvotes: 0
Views: 990
Reputation: 364039
Providing 32-byte alignment costs extra instructions (because the ABI only guarantees 16-byte alignment; just look at the asm for the version with alignas(32)
or __attribute__((aligned(32)))
). Of course the compiler doesn't do it if you don't ask for it, because it's not free. (See also gcc's -mpreferred-stack-boundary
which controls this, and the x86 tag wiki for links to ABI docs).
double localvar[4];
only needs to be 8-byte aligned for each element to be naturally aligned. The SysV x86-64 ABI does guarantee 16-byte alignment for C99 variable-size arrays. I'm not sure if normal compile-time-constant sized arrays get 16-B alignment by default or not.
However, current versions of gcc for some reason align the stack to 32B in a test function that has __m256d
local variables. At -O3
it doesn't spill them to the stack, so they're wasted (other than making buggy code like this happen to work). The fact that gcc doesn't remove this stuff is a missed-optimation. (It's needed at -O0
where gcc does spill everything to memory.)
Since my version of your test function (which actually compiles) doesn't have any other locals, the array of doubles is also 32B-aligned. Presumably you're inlining it into a caller that has some other locals, and that leads to different alignment for the array.
Here's the code on the Godbolt compiler explorer:
extern void use_buffer(double*);
// static inline
void no_alignment(const double *phi){
double localvar[4];
__m256d var = _mm256_load_pd (phi);
__m256d res = _mm256_mul_pd(var, var);
_mm256_storeu_pd (localvar, res); // use an unaligned store since we didn't request alignment for the buffer
use_buffer(localvar);
}
lea r10, [rsp+8] // save old RSP (in a clumsy way)
and rsp, -32 // truncate RSP to the next 32B boundary
push QWORD PTR [r10-8] // save more stuff
push rbp
mov rbp, rsp
push r10
sub rsp, 40
... vmovupd YMMWORD PTR [rbp-48], ymm0 ... // function body
add rsp, 40
pop r10
pop rbp
lea rsp, [r10-8]
This is why your code happens to work when it's not inlined. Although it's strange that it doesn't get inlined anyway, even without the inline
keyword, unless you compiled without optimization or you didn't use static
to let the compiler know that a separate definition wasn't needed.
Upvotes: 3
Reputation: 12175
_mm256_store_pd requires that the memory address you are storing to must be aligned to a 32 byte boundary. However in C I only think the standard alignment for and 8 byte double is an 8 byte boundary.
If I had to guess when the function is not inlined it starts the localvar array on a 32 byte boundary. I am not sure if this is a guarantee or just luck. I am guessing luck because inlining a function in theory should not change anything. The compiler may be pushing just the right number of bytes onto the stack so that it becomes aligned. Also I see no reason why it would guarantee a 32 byte alignment.
When it is inlined it would act as if the code was just typed where you are calling the function. Therefore you are only guaranteed that localvar will be 8 byte aligned rather than the guaranteed 32 byte alignment. I think the proper solution is to use the aligned attribute which solves your problem. You could also use the _mm256_storeu_pd
intrinsic which does the same thing without the alignment requirement. From my experience with my haswell CPU it is just as fast.
Upvotes: 2