Reputation: 320777

`movaps` vs. `movups` in GCC: how does it decide?

I recently researched a segfault in a piece of software compiled with GCC 8. The code looked as follows (this is just a sketch)

struct Point
{
  int64_t x, y;
};

struct Edge
{
  // some other fields
  // ...
  Point p; // <- at offset `0xC0`

  Edge(const Point &p) p(p) {}
};

Edge *create_edge(const Point &p)
{
  void *raw_memory = my_custom_allocator(sizeof(Edge));
  return new (raw_memory) Edge(p);
}

The key point here is that my_custom_allocator() returns pointers to unaligned memory. The code crashes because in order to copy the original point p into the field Edge::p of the new object the compiler used a movdqu/movaps pair in the [inlined] constructor code

movdqu 0x0(%rbp), %xmm1  ; read the original object at `rbp`
...
movaps %xmm1, 0xc0(%rbx) ; store it into the new `Edge` object at `rbx` - crash!

At first, everything seems to be clear here: the memory is not properly aligned, movaps crashes. My fault.

But is it?

Attempting to reproduce the problem on Godbolt I observe that GCC 8 actually attempts to handle it fairly intelligently. When it is sure that the memory is properly aligned it uses movaps, just like in my code. This

#include <new>
#include <cstdlib>

struct P { unsigned long long x, y; };

unsigned char buffer[sizeof(P) * 100];

void *alloc()
{
  return buffer;
}

void foo(const P& s)
{
  void *raw = alloc();
  new (raw) P(s);
}

results in this

foo(P const&):
    movdqu  xmm0, XMMWORD PTR [rsi]
    movaps  XMMWORD PTR buffer[rip], xmm0
    ret

https://godbolt.org/z/a3uSid

But when it is not sure, it uses movups. E.g. if I "hide" the definition of the allocator in the above example, it will opt for movups in the same code

foo(P const&):
    push    rbx
    mov     rbx, rdi
    call    alloc()
    movdqu  xmm0, XMMWORD PTR [rbx]
    movups  XMMWORD PTR [rax], xmm0
    pop     rbx
    ret

https://godbolt.org/z/cNKe5A

So, if it is supposed to behave that way, why is it using movaps in the software I mentioned at the beginning of this post? In my case the implementation of my_custom_allocator() is not visible to the compiler at the point of the call, which is why I'd expect GCC to opt for movups.

What are the other factors that might be at play here? Is it a bug in GCC? How can I force GCC to use movups, preferably everywhere?

Upvotes: 4

Answers (3)

Peter Cordes

Reputation: 365971

Update: alignof(Edge) was 16 because of long double on x86-64 System V, so it's UB to have one at a less-aligned address. This tells GCC it's safe to use movaps.

IDK why loading it from (%rbp) didn't also use movaps. I thought that implied Edge wouldn't be 16-byte aligned, so there's a whole section of this answer based on that guess (which I moved to the end).

Some types can require 16-byte alignment, notably `long double`

alignof(max_align_t) == 16 on x86-64 System V. A drop-in replacement for malloc needs to return memory at least that aligned, for allocations of 16 bytes or larger.

(Smaller allocations of course couldn't hold a 16-byte object and therefore can't require 16-byte alignment. You can ask for a specific instance of an object to be over-aligned with alignas(16) int foo;, but if a type itself has higher alignment it also has larger sizeof so an array will still obey the normal rules as well as having every element satisfy the alignment requirement.)

See also Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? where auto-vectorization with a misaligned uint16_t* leads to a segfault. Also Pascal Cuoq's blog about alignment and having objects with less alignment than alignof(T) is undefined behaviour, and how assumption of no UB runs deep for compilers.

Instruction selection

GCC and clang use movaps whenever they can prove that memory must be sufficiently aligned. (By assuming no UB). On Core2 and earlier, and K10 and earlier, unaligned store instructions are slow even if the memory happens to be aligned at runtime.

Nehalem and Bulldozer changed that, but GCC still uses movaps even with -mtune=haswell, or even vmovaps with -march=haswell even though that can only execute on CPUs with cheap vmovups.

MSVC and ICC never use movaps, hurting perf on very old CPUs but letting you get away with misaligning data sometimes. They will fold aligned loads into memory operands for SSE instructions like paddd xmm0, [rdi] (which requires alignment, unlike the AVX1 equivalent) so they will still make code that faults on misalignment sometimes, but usually only with optimization enabled. IMO that's not particularly great.

alignof(Point) should only be 8 (inheriting the alignment of its most-aligned member, an int64_t). So GCC can only prove 8-byte alignment for an arbitrary Point, not 16.

For static storage, GCC can know that it chose to align the array by 16 and thus can use movaps / movdqa to load from it. (Also, the x86-64 System V ABI requires that static arrays of 16 bytes or larger be aligned by 16, so GCC can assume this even for an extern unsigned char buffer[] global defined in some other compilation unit.)

You haven't shown a definition for Edge so IDK why it has 16-byte alignment, but possibly alignof(Edge) == 16? Otherwise yes, that might to be a compiler bug.

But the fact that it loads the original Edge object from the stack with movups would seem to indicate that alignof(Edge) < 16

Possibly raw_memory = __builtin_assume_aligned(raw_memory, 8); could help? IDK if that can tell GCC to assume lower alignment than it already thought it could assume based on other factors.

You could tell GCC that Edge (or int for that matter) can always be under-aligned by defining a typedef like this:

typedef long __attribute__((aligned(1), may_alias)) unaligned_aliasing_long;

may_alias is actually orthogonal to alignment, but it's worth mentioning because one of the use-cases for this would be loads out of a char[] buffer for parsing a byte stream. In that case you'd want both. That's an alternative to using memcpy(tmp, src, sizeof(tmp)); to do unaligned strict-aliasing-safe loads.

GCC uses may_alias to define __m128, and may_alias,aligned(1) as part of defining _mm_loadu_ps (the intrinsic for unaligned SIMD loads like movups). (You don't need may_alias for loading a vector of float from a float array, but you do need may_alias for loading it from something else.) See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

And see Why does glibc's strlen need to be so complicated to run quickly? for scalar code that I think is safe for under-aligned / aliasing unsigned long, unlike glibc's fallback C implementation. (Which has to be compiled without -flto so it can't inline into other glibc functions and break because of strict-aliasing violation.)

Allocators and assumed alignment

(This section was written assuming that alignof(Edge) < 16. This was not the case here, and the function attributes might be useful to know about even though they're not the cause of the problem. And probably not a viable workaround either.)

You might be able to use __attribute__ ((assume_aligned (8))) on your allocator to tell GCC about the alignment of the pointer it returns.

GCC may possibly be assuming for some reason that your allocator returns memory usable for any object (and alignof(max_align_t) == 16 on x86-64 System V because of long double and other things, and also on Windows x64).

If this is not the case, you may be able to tell it that. This mmap mis-alignment Q&A, we can see that GCC does "know about" malloc and treat it specially. But if your function doesn't have an ISO C or C++ defined name, or GNU C attributes, that would be surprising. IDK, it's the best guess so far based on what you've shown, if it's not a compiler bug. (That is possible.)

From the GCC manual:

void* my_alloc1 (size_t) __attribute__((assume_aligned (16)));
void* my_alloc2 (size_t) __attribute__((assume_aligned (32, 8)));
declares that my_alloc1 returns 16-byte aligned pointers and that my_alloc2 returns a pointer whose value modulo 32 is equal to 8.

I don't know why it would assume that a void* returned by a function and cast to another type would have any more alignment than the type of the object being constructed, though. We can that it uses movups to load an Edge from somewhere. That would seem to indicate that alignof(Edge) < 16.

Also relevant is __attribute__((alloc_size(1))) to tell GCC that the first arg to the function is a size. If your function takes an explicit alignment as an arg, use alloc_align (position) to indicate that, otherwise don't.

Upvotes: 8

AnT stands with Russia

Reputation: 320777

As is correctly stated by other participants in the already posted answers, the triggering factor is the alignment requirements of my data type. The specific culprit turned out to be a long double data field also present in my struct, which slipped my attention initially. This long double data field forced the alignment requirement of the entire struct to become 16.

Again, formally, there's no room for debate here: violating this alignment requirement leads to undefined behavior. End of story.

But practically (referring to the implementation-specific behavior of GCC), this does not appear to be as clear cut though. There still is a strange peculiarity in GCC's behavior here.

Above, in my original question you can see an example of a struct with an alignment requirement of 8 (assume it has no long double fields in it). With this data type GCC behaves as I already described above:

When alignment of the raw_pointer is obvious to the compiler and it is known to be 16 or greater, GCC generates movaps instructions.
When alignment of the raw_pointer is obvious to the compiler and it is known to be less than 16, GCC generates movups instructions.
When alignment of the raw_pointer is not obvious to the compiler, it generates movups instructions.

So, in this case GCC plays it safe, it behaves permissively/defensively. Even if data is not aligned, in practice the code will operate "as expected". (Maybe I'm missing something and it is possible to make it GPF with 8-aligned data a well, but for what it's worth, I haven't encountered it yet.)

But once we jump to 16-aligned struct (say, by adding a long double field), the GCC logic changes into the following:

When alignment of the raw_pointer is obvious to the compiler and it is known to be 16 or greater, GCC generates movaps instructions.
When alignment of the raw_pointer is obvious to the compiler and it is known to be less than 16, GCC generates movups instructions.
When alignment of the raw_pointer is not obvious to the compiler, it generates movaps instructions (yes, movaps!)

Note the third point: this little detail is what caused the GPF in the aforementioned project. Here's a small example of the same crash: http://coliru.stacked-crooked.com/a/c5cd2be91ebba41e . (BTW, Clang appears to be even more strict in this regard. 16-aligned data? Use movaps, even if the pointer is "obviously" unaligned.)

Looking at situations 1 and 2, it appears that with 16-aligned data GCC also kinda intended to behave permissively/defensively, just like it does with 8-aligned data. But for some reason for situation 3 it chooses to go with movaps instead of movups. Why the inconsistency with 8-aligned decision process?

Again, obviously, "the behavior is undefined, it is your fault". But the above inconsistency between decisions made for 8-aligned and 16-aligned data strike me as a little strange. If this is intentional, it would at least be useful to have an option to have GCC handle 16-aligned data in the same way as it handles 8-aligned data, i.e. use movups when things are not entirely transparent.

On the second thought, there's really no "inconsistency" here. The logic is solid: with 8-aligned data GCC cannot assume universal applicability of movaps, so it has to act defensively even if the data is perfectly 8-aligned. With 16-aligned data GCC can formally deduce applicability of movaps in all cases, so it does not have to act defensively.

As a quick workaround for those who can't or don't want to 16-align their structs for some reason (memory savings, legacy projects etc.): declaring long double fields as packed "kills" their alignment requirement. If by doing so you successfully reduce the alignment requirement of the struct to 8 or less, the good old permissive GCC behavior will return.

Upvotes: 3

1201ProgramAlarm

Reputation: 32727

Since the Edge struct has an compiler determined alignment requirement, the compiler s is free to assume that all objects of that type are properly aligned. If your custom allocator does not return a pointer to properly aligned memory, your use of an object at that address results in Undefined Behavior.

Upvotes: 1

`movaps` vs. `movups` in GCC: how does it decide?

Answers (3)

Some types can require 16-byte alignment, notably long double

Instruction selection

Allocators and assumed alignment

Related Questions

Some types can require 16-byte alignment, notably `long double`