What does memory 32bit Alignement constraint mean for AVX?

Question

The documentation of _mm256_load_ps states that the memory must be 32bit-aligned in order to load the values into the registers.

So I found that post that explained how an address is 32bit aligned.

#include 
#include 

int main() {
    std::vector A(height * width, 0);
    std::cout << "&A = " << A.data() << std::endl; // 0x55e960270eb0
    __m256 a_row = _mm256_load_ps(A.data());
    return 0; // Exit Code 139 SIGSEGV 
}

So tried that code. And I expected it to work. I checked the address
0x55e960270eb0 % 4 = 0 and floats are 4 bytes in size.
I am completely baffled by the reason. If I use a raw array with malloc, suddenly everything works

#include 
#include 

int main() {
    std::vector A(height * width, 0);
    std::cout << "&A = " << A.data() << std::endl; // &A = 0x55e960270eb0


    float* m = static_cast(_mm_malloc(A.size() * sizeof(float), 32));
    std::cout << "m* = " << m << std::endl; // m* = 0x562bbe989700

    __m256 a_row = _mm256_load_ps(m);

    delete m;

    return 0; // Returns 0
}

What am I missing/misinterpreting ?

Arty · Accepted Answer

You missread this - it says 32 BYTE aligned, not BIT.

So you have to do 32-byte alignment instead of 4-byte alignment.

To align any stack variable you can use alignas(32) T var;, where T can be any type for example std::array.

To align std::vector's memory or any other heap-based structure alignas(...) is not enough, you have to write special aligning allocator (see Test() function for example of usage):

Try it online!

#include 
#include 

// Following includes for tests only
#include 
#include 
#include 

template 
class AlignmentAllocator {
  public:
    typedef T value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;

  public:
    inline AlignmentAllocator() throw() {}
    template  inline AlignmentAllocator(const AlignmentAllocator &) throw() {}
    inline ~AlignmentAllocator() throw() {}
    inline pointer adress(reference r) { return &r; }
    inline const_pointer adress(const_reference r) const { return &r; }
    inline pointer allocate(size_type n);
    inline void deallocate(pointer p, size_type);
    inline void construct(pointer p, const value_type & wert);
    inline void destroy(pointer p) { p->~value_type(); }
    inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
    template  struct rebind { typedef AlignmentAllocator other; };
    bool operator!=(const AlignmentAllocator & other) const { return !(*this == other); }
    bool operator==(const AlignmentAllocator & other) const { return true; }
};

template 
inline typename AlignmentAllocator::pointer AlignmentAllocator::allocate(size_type n) {
    #if _MSC_VER
        return (pointer)_aligned_malloc(n * sizeof(value_type), N);
    #else
        void * p0 = nullptr;
        int r = posix_memalign(&p0, N, n * sizeof(value_type));
        if (r != 0) return 0;
        return (pointer)p0;
    #endif
}
template 
inline void AlignmentAllocator::deallocate(pointer p, size_type) {
    #if _MSC_VER
        _aligned_free(p);
    #else
        std::free(p);
    #endif
}
template 
inline void AlignmentAllocator::construct(pointer p, const value_type & wert) {
    new (p) value_type(wert);
}

template 
using AlignedVector = std::vector>;

template 
void Test() {
    AlignedVector v(1);
    size_t uptr = size_t(v.data()), alignment = 0;
    while (!(uptr & 1)) {
        ++alignment;
        uptr >>= 1;
    }
    std::cout << "Requested: " << Align << ", Actual: " << (1 << alignment) << std::endl;
}

int main() {
    Test<8>();
    Test<16>();
    Test<32>();
    Test<64>();
    Test<128>();
    Test<256>();
}

Output:

Requested: 8, Actual: 16
Requested: 16, Actual: 16
Requested: 32, Actual: 32
Requested: 64, Actual: 128
Requested: 128, Actual: 8192
Requested: 256, Actual: 256

You may see in code above that I used posix_memalign() for CLang/GCC and _aligned_malloc() for MSVC. Starting from C++17 there also exists std::aligned_alloc() but seems that not all compilers implemented it, at least MSVC didn't. Looks like on CLang/GCC you can use this std::aligned_alloc() instead of posix_memalign() as commented by @Mgetz.

Also as Intel guide says here you can use _mm_malloc() and _mm_free() instead of posix_memalign()/_aligned_malloc()/_aligned_free()/std::aligned_alloc()/std::free().

What does memory 32bit Alignement constraint mean for AVX?

Answers (1)

Related Questions