Reputation: 16747
It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.
Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack variable, for example following code:
#include <cstdint>
#include <iostream>
int main() {
alignas(1024) int x[3] = {1, 2, 3};
alignas(1024) int (&y)[3] = *(&x);
std::cout << uint64_t(&x) % 1024 << " "
<< uint64_t(&x) % 16384 << std::endl;
std::cout << uint64_t(&y) % 1024 << " "
<< uint64_t(&y) % 16384 << std::endl;
}
Outputs:
0 9216
0 9216
which means that both x
and y
are aligned on stack on 1024 bytes but not 16384 bytes.
Lets now see another code:
#include <cstdint>
void f(uint64_t * x, uint64_t * y) {
for (int i = 0; i < 16; ++i)
x[i] ^= y[i];
}
if compiled with -std=c++20 -O3 -mavx512f
attributes on GCC it produces following asm code (provided part of code):
vmovdqu64 zmm1, ZMMWORD PTR [rdi]
vpxorq zmm0, zmm1, ZMMWORD PTR [rsi]
vmovdqu64 ZMMWORD PTR [rdi], zmm0
vmovdqu64 zmm0, ZMMWORD PTR [rsi+64]
vpxorq zmm0, zmm0, ZMMWORD PTR [rdi+64]
vmovdqu64 ZMMWORD PTR [rdi+64], zmm0
which two times does AVX-512 unaligned load + xor + unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.
My question is how to tell GCC that provided to function pointers x
and y
are both aligned to 64 bytes, so that instead of unaligned load (vmovdqu64
) like in code above, I can force GCC to use aligned load (vmovdqa64
). It is known that aligned load/store can be considerably faster.
My first try to force GCC to do aligned load/store was through following code:
#include <cstdint>
void g(uint64_t (&x_)[16],
uint64_t const (&y_)[16]) {
alignas(64) uint64_t (&x)[16] = x_;
alignas(64) uint64_t const (&y)[16] = y_;
for (int i = 0; i < 16; ++i)
x[i] ^= y[i];
}
but this code still produces unaligned load (vmovdqu64
) same as in asm code above (of previous code snippet). Hence this alignas(64)
hint doesn't give anything useful to improve GCC assembly code.
My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64()
?
If possible I need solutions for all of GCC/CLang/MSVC.
Upvotes: 2
Views: 1319
Reputation: 13679
As I imply from your own answer, you're interested in MSVC solution too.
MSVC understands the proper use of alignas
as well as its own __declspec(align)
, it also understands __builtin_assume_aligned
, but it intentionally does not want to do anything with known alignment.
My report closed as "Duplicate":
The related reports closed as "Not a bug":
MSVC still takes advantage of alignment of global variables, if it can observe that the pointer points to the global variable. Even this does not work in every case.
Upvotes: 1
Reputation: 1638
Though not entirely portable for all compilers, __builtin_assume_aligned
will tell GCC to assume the pointer are aligned.
I often use a different strategy that is more portable using a helper struct:
template<size_t Bits>
struct alignas(Bits/8) uint64_block_t
{
static const size_t bits = Bits;
static const size_t size = bits/64;
std::array<uint64_t,size> v;
uint64_block_t& operator&=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] &= v2.v[i]; return *this; }
uint64_block_t& operator^=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] ^= v2.v[i]; return *this; }
uint64_block_t& operator|=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] |= v2.v[i]; return *this; }
uint64_block_t operator&(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp &= v2; }
uint64_block_t operator^(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp ^= v2; }
uint64_block_t operator|(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp |= v2; }
uint64_block_t operator~() const { uint64_block_t tmp; for (size_t i = 0; i < size; ++i) tmp.v[i] = ~v[i]; return tmp; }
bool operator==(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return false; return true; }
bool operator!=(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return true; return false; }
bool get_bit(size_t c) const { return (v[c/64]>>(c%64))&1; }
void set_bit(size_t c) { v[c/64] |= uint64_t(1)<<(c%64); }
void flip_bit(size_t c) { v[c/64] ^= uint64_t(1)<<(c%64); }
void clear_bit(size_t c) { v[c/64] &= ~(uint64_t(1)<<(c%64)); }
void set_bit(size_t c, bool b) { v[c/64] &= ~(uint64_t(1)<<(c%64)); v[c/64] |= uint64_t(b ? 1 : 0)<<(c%64); }
size_t hammingweight() const { size_t w = 0; for (size_t i = 0; i < size; ++i) w += mccl::hammingweight(v[i]); return w; }
bool parity() const { uint64_t x = 0; for (size_t i = 0; i < size; ++i) x ^= v[i]; return mccl::hammingweight(x)%2; }
};
and then convert the pointer to uint64_t to a pointer to this struct using reinterpret_cast.
Converting a loop over uint64_t into a loop over these blocks typically auto vectorize very well.
Upvotes: 1
Reputation: 16747
Just now @MarcStevens suggested a working solution for my Question, through using __builtin_assume_aligned:
#include <cstdint>
void f(uint64_t * x_, uint64_t * y_) {
uint64_t * x = (uint64_t *)__builtin_assume_aligned(x_, 64);
uint64_t * y = (uint64_t *)__builtin_assume_aligned(y_, 64);
for (int i = 0; i < 16; ++i)
x[i] ^= y[i];
}
It actually produces code with aligned vmovdqa64
instruction.
But only GCC produces aligned instruction. CLang still uses unaligned, see here, also CLang uses AVX-512 registers only with more than 16 elements.
So still CLang and also MSVC solutions are welcome.
Upvotes: 1