Reputation: 11
my issue concerns deriving an unaligned __m512 pointer to a memory space containing floats. I find that GCC and Clang are somewhat unstable in generating the correct uop (unaligned vs aligned) when accessing memory through such a contraption.
First, the working case:
typedef float MyFloatVector __attribute__((vector_size(64), aligned(4)));
MyFloatVector* vec_ptr = reinterpret_cast<MyFloatVector*>(float_ptr);
Something(*vec_ptr);
Both Clang and GCC generate MOVUPS for the above. However, if the type for vec_ptr is left for the compiler:
typedef float MyFloatVector __attribute__((vector_size(64), aligned(4)));
auto vec_ptr = reinterpret_cast<MyFloatVector *>(float_ptr);
Something(*vec_ptr);
Now, Clang will generate MOVAPS and a segfault down the line. GCC will still generate MOVUPS, but also three do-nothing instructions (push rbp, load rsp to rbp, pop rbp).
Also, if I change from typedef to using:
using MyFloatVector = float __attribute__((vector_size(64), aligned(4)));
MyFloatVector*vec_ptr = reinterpret_cast<MyFloatVector*>(float_ptr);
Something(*vec_ptr);
Again GCC generates the fluff instructions and Clang generates MOVAPS. Using auto here gives the same result.
So, does anyone have any idea what's happening under the hood, and is there a safe way to do the conversion. While there exists a working solution, IMO the discrepancies generated by typedef/using and explicit/auto make it far too unreliable to use with confidence--at the minimum I'd need a static assert to check that the generated uop when dereferencing the pointer is unaligned, which doesn't exist AFAIK.
In some cases I might want to have a MyFloatVector-reference to the memory area, which rules out using intrinsics.
Sample code: https://godbolt.org/z/caxScz. Includes ICC for "fun", which generates MOVUPS throughout.
Upvotes: 1
Views: 781
Reputation: 17492
When you use reinterpret_cast
you're telling the compiler that the argument points to a valid object of the requested type. That means that it has the same alignment requirements.
ICC is being more conservative here, while clang and GCC are trying to make your code go faster by assuming that you're actually adhering to the standard.
Keep in mind that the aligned attribute can only be used to increase alignment requirements, not to decrease them, so in your code you're just saying that the types have a minimum alignment of 4 bytes. If you add a static_assert(alignof(MyFloatVector) == 4, "Alignment should be 4")
you'll probably see some failures, depending on how exactly you declare it.
Since you're not using __m512
, _mm512_loadu_ps
would work but probably isn't really the right way to go IMHO. The correct way to load unaligned data is to use memcpy
(or __builtin_memcpy
, since you're using vector extensions anyways). Compilers are really good about optimizing memcpy with known sizes, as long as you're using a relatively recent compiler you should end up with a vmovups on x86 with AVX-512F enabled.
Upvotes: 2