Reputation: 320371
I recently researched a segfault in a piece of software compiled with GCC 8. The code looked as follows (this is just a sketch)
struct Point
{
int64_t x, y;
};
struct Edge
{
// some other fields
// ...
Point p; // <- at offset `0xC0`
Edge(const Point &p) p(p) {}
};
Edge *create_edge(const Point &p)
{
void *raw_memory = my_custom_allocator(sizeof(Edge));
return new (raw_memory) Edge(p);
}
The key point here is that my_custom_allocator()
returns pointers to unaligned memory. The code crashes because in order to copy the original point p
into the field Edge::p
of the new object the compiler used a movdqu
/movaps
pair in the [inlined] constructor code
movdqu 0x0(%rbp), %xmm1 ; read the original object at `rbp`
...
movaps %xmm1, 0xc0(%rbx) ; store it into the new `Edge` object at `rbx` - crash!
At first, everything seems to be clear here: the memory is not properly aligned, movaps
crashes. My fault.
But is it?
Attempting to reproduce the problem on Godbolt I observe that GCC 8 actually attempts to handle it fairly intelligently. When it is sure that the memory is properly aligned it uses movaps
, just like in my code. This
#include <new>
#include <cstdlib>
struct P { unsigned long long x, y; };
unsigned char buffer[sizeof(P) * 100];
void *alloc()
{
return buffer;
}
void foo(const P& s)
{
void *raw = alloc();
new (raw) P(s);
}
results in this
foo(P const&):
movdqu xmm0, XMMWORD PTR [rsi]
movaps XMMWORD PTR buffer[rip], xmm0
ret
But when it is not sure, it uses movups
. E.g. if I "hide" the definition of the allocator in the above example, it will opt for movups
in the same code
foo(P const&):
push rbx
mov rbx, rdi
call alloc()
movdqu xmm0, XMMWORD PTR [rbx]
movups XMMWORD PTR [rax], xmm0
pop rbx
ret
So, if it is supposed to behave that way, why is it using movaps
in the software I mentioned at the beginning of this post? In my case the implementation of my_custom_allocator()
is not visible to the compiler at the point of the call, which is why I'd expect GCC to opt for movups
.
What are the other factors that might be at play here? Is it a bug in GCC? How can I force GCC to use movups
, preferably everywhere?
Upvotes: 4
Views: 6148
Reputation: 363980
Update: alignof(Edge)
was 16 because of long double
on x86-64 System V, so it's UB to have one at a less-aligned address. This tells GCC it's safe to use movaps
.
IDK why loading it from (%rbp)
didn't also use movaps
. I thought that implied Edge wouldn't be 16-byte aligned, so there's a whole section of this answer based on that guess (which I moved to the end).
long double
alignof(max_align_t) == 16
on x86-64 System V. A drop-in replacement for malloc
needs to return memory at least that aligned, for allocations of 16 bytes or larger.
(Smaller allocations of course couldn't hold a 16-byte object and therefore can't require 16-byte alignment. You can ask for a specific instance of an object to be over-aligned with alignas(16) int foo;
, but if a type itself has higher alignment it also has larger sizeof
so an array will still obey the normal rules as well as having every element satisfy the alignment requirement.)
See also Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where auto-vectorization with a misaligned uint16_t*
leads to a segfault. Also Pascal Cuoq's blog about alignment and having objects with less alignment than alignof(T)
is undefined behaviour, and how assumption of no UB runs deep for compilers.
GCC and clang use movaps
whenever they can prove that memory must be sufficiently aligned. (By assuming no UB). On Core2 and earlier, and K10 and earlier, unaligned store instructions are slow even if the memory happens to be aligned at runtime.
Nehalem and Bulldozer changed that, but GCC still uses movaps
even with -mtune=haswell
, or even vmovaps
with -march=haswell
even though that can only execute on CPUs with cheap vmovups
.
MSVC and ICC never use movaps
, hurting perf on very old CPUs but letting you get away with misaligning data sometimes. They will fold aligned loads into memory operands for SSE instructions like paddd xmm0, [rdi]
(which requires alignment, unlike the AVX1 equivalent) so they will still make code that faults on misalignment sometimes, but usually only with optimization enabled. IMO that's not particularly great.
alignof(Point)
should only be 8 (inheriting the alignment of its most-aligned member, an int64_t
). So GCC can only prove 8-byte alignment for an arbitrary Point
, not 16.
For static storage, GCC can know that it chose to align the array by 16 and thus can use movaps
/ movdqa
to load from it. (Also, the x86-64 System V ABI requires that static arrays of 16 bytes or larger be aligned by 16, so GCC can assume this even for an extern unsigned char buffer[]
global defined in some other compilation unit.)
You haven't shown a definition for Edge
so IDK why it has 16-byte alignment, but possibly alignof(Edge) == 16
? Otherwise yes, that might to be a compiler bug.
But the fact that it loads the original Edge
object from the stack with movups
would seem to indicate that alignof(Edge) < 16
Possibly raw_memory = __builtin_assume_aligned(raw_memory, 8);
could help? IDK if that can tell GCC to assume lower alignment than it already thought it could assume based on other factors.
You could tell GCC that Edge
(or int
for that matter) can always be under-aligned by defining a typedef like this:
typedef long __attribute__((aligned(1), may_alias)) unaligned_aliasing_long;
may_alias
is actually orthogonal to alignment, but it's worth mentioning because one of the use-cases for this would be loads out of a char[]
buffer for parsing a byte stream. In that case you'd want both. That's an alternative to using memcpy(tmp, src, sizeof(tmp));
to do unaligned strict-aliasing-safe loads.
GCC uses may_alias
to define __m128
, and may_alias,aligned(1)
as part of defining _mm_loadu_ps
(the intrinsic for unaligned SIMD loads like movups
). (You don't need may_alias
for loading a vector of float from a float
array, but you do need may_alias
for loading it from something else.) See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
And see Why does glibc's strlen need to be so complicated to run quickly? for scalar code that I think is safe for under-aligned / aliasing unsigned long
, unlike glibc's fallback C implementation. (Which has to be compiled without -flto
so it can't inline into other glibc functions and break because of strict-aliasing violation.)
(This section was written assuming that alignof(Edge) < 16
. This was not the case here, and the function attributes might be useful to know about even though they're not the cause of the problem. And probably not a viable workaround either.)
You might be able to use __attribute__ ((assume_aligned (8)))
on your allocator to tell GCC about the alignment of the pointer it returns.
GCC may possibly be assuming for some reason that your allocator returns memory usable for any object (and alignof(max_align_t) == 16
on x86-64 System V because of long double
and other things, and also on Windows x64).
If this is not the case, you may be able to tell it that. This mmap
mis-alignment Q&A, we can see that GCC does "know about" malloc
and treat it specially. But if your function doesn't have an ISO C or C++ defined name, or GNU C attributes, that would be surprising. IDK, it's the best guess so far based on what you've shown, if it's not a compiler bug. (That is possible.)
From the GCC manual:
void* my_alloc1 (size_t) __attribute__((assume_aligned (16))); void* my_alloc2 (size_t) __attribute__((assume_aligned (32, 8)));
declares that
my_alloc1
returns 16-byte aligned pointers and thatmy_alloc2
returns a pointer whose value modulo 32 is equal to 8.
I don't know why it would assume that a void*
returned by a function and cast to another type would have any more alignment than the type of the object being constructed, though. We can that it uses movups
to load an Edge
from somewhere. That would seem to indicate that alignof(Edge) < 16
.
Also relevant is __attribute__((alloc_size(1)))
to tell GCC that the first arg to the function is a size. If your function takes an explicit alignment as an arg, use alloc_align (position)
to indicate that, otherwise don't.
Upvotes: 8
Reputation: 320371
As is correctly stated by other participants in the already posted answers, the triggering factor is the alignment requirements of my data type. The specific culprit turned out to be a long double
data field also present in my struct
, which slipped my attention initially. This long double
data field forced the alignment requirement of the entire struct to become 16.
Again, formally, there's no room for debate here: violating this alignment requirement leads to undefined behavior. End of story.
But practically (referring to the implementation-specific behavior of GCC), this does not appear to be as clear cut though. There still is a strange peculiarity in GCC's behavior here.
Above, in my original question you can see an example of a struct with an alignment requirement of 8 (assume it has no long double
fields in it). With this data type GCC behaves as I already described above:
raw_pointer
is obvious to the compiler and it is known to be 16 or greater, GCC generates movaps
instructions.raw_pointer
is obvious to the compiler and it is known to be less than 16, GCC generates movups
instructions.raw_pointer
is not obvious to the compiler, it generates movups
instructions.So, in this case GCC plays it safe, it behaves permissively/defensively. Even if data is not aligned, in practice the code will operate "as expected". (Maybe I'm missing something and it is possible to make it GPF with 8-aligned data a well, but for what it's worth, I haven't encountered it yet.)
But once we jump to 16-aligned struct (say, by adding a long double
field), the GCC logic changes into the following:
raw_pointer
is obvious to the compiler and it is known to be 16 or greater, GCC generates movaps
instructions.raw_pointer
is obvious to the compiler and it is known to be less than 16, GCC generates movups
instructions.raw_pointer
is not obvious to the compiler, it generates movaps
instructions (yes, movaps
!)Note the third point: this little detail is what caused the GPF in the aforementioned project. Here's a small example of the same crash: http://coliru.stacked-crooked.com/a/c5cd2be91ebba41e . (BTW, Clang appears to be even more strict in this regard. 16-aligned data? Use movaps
, even if the pointer is "obviously" unaligned.)
Looking at situations 1 and 2, it appears that with 16-aligned data GCC also kinda intended to behave permissively/defensively, just like it does with 8-aligned data. But for some reason for situation 3 it chooses to go with movaps
instead of movups
. Why the inconsistency with 8-aligned decision process?
Again, obviously, "the behavior is undefined, it is your fault". But the above inconsistency between decisions made for 8-aligned and 16-aligned data strike me as a little strange. If this is intentional, it would at least be useful to have an option to have GCC handle 16-aligned data in the same way as it handles 8-aligned data, i.e. use movups
when things are not entirely transparent.
On the second thought, there's really no "inconsistency" here. The logic is solid: with 8-aligned data GCC cannot assume universal applicability of movaps
, so it has to act defensively even if the data is perfectly 8-aligned. With 16-aligned data GCC can formally deduce applicability of movaps
in all cases, so it does not have to act defensively.
As a quick workaround for those who can't or don't want to 16-align their structs for some reason (memory savings, legacy projects etc.): declaring long double
fields as packed
"kills" their alignment requirement. If by doing so you successfully reduce the alignment requirement of the struct to 8 or less, the good old permissive GCC behavior will return.
Upvotes: 3
Reputation: 32732
Since the Edge
struct has an compiler determined alignment requirement, the compiler s is free to assume that all objects of that type are properly aligned. If your custom allocator does not return a pointer to properly aligned memory, your use of an object at that address results in Undefined Behavior.
Upvotes: 1