Reputation: 63
The documentation of _mm256_load_ps states that the memory must be 32bit-aligned in order to load the values into the registers.
So I found that post that explained how an address is 32bit aligned.
#include <immintrin.h>
#include <vector>
int main() {
std::vector<float> A(height * width, 0);
std::cout << "&A = " << A.data() << std::endl; // 0x55e960270eb0
__m256 a_row = _mm256_load_ps(A.data());
return 0; // Exit Code 139 SIGSEGV
}
So tried that code.
And I expected it to work.
I checked the address
0x55e960270eb0 % 4 = 0 and floats are 4 bytes in size.
I am completely baffled by the reason.
If I use a raw array with malloc, suddenly everything works
#include <immintrin.h>
#include <vector>
int main() {
std::vector<float> A(height * width, 0);
std::cout << "&A = " << A.data() << std::endl; // &A = 0x55e960270eb0
float* m = static_cast<float*>(_mm_malloc(A.size() * sizeof(float), 32));
std::cout << "m* = " << m << std::endl; // m* = 0x562bbe989700
__m256 a_row = _mm256_load_ps(m);
delete m;
return 0; // Returns 0
}
What am I missing/misinterpreting ?
Upvotes: 1
Views: 419
Reputation: 16747
You missread this - it says 32 BYTE aligned, not BIT.
So you have to do 32-byte alignment instead of 4-byte alignment.
To align any stack variable you can use alignas(32) T var;
, where T
can be any type for example std::array<float, 8>
.
To align std::vector
's memory or any other heap-based structure alignas(...)
is not enough, you have to write special aligning allocator (see Test()
function for example of usage):
#include <cstdlib>
#include <memory>
// Following includes for tests only
#include <vector>
#include <iostream>
#include <cmath>
template <typename T, std::size_t N>
class AlignmentAllocator {
public:
typedef T value_type;
typedef std::size_t size_type;
typedef std::ptrdiff_t difference_type;
typedef T * pointer;
typedef const T * const_pointer;
typedef T & reference;
typedef const T & const_reference;
public:
inline AlignmentAllocator() throw() {}
template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
inline ~AlignmentAllocator() throw() {}
inline pointer adress(reference r) { return &r; }
inline const_pointer adress(const_reference r) const { return &r; }
inline pointer allocate(size_type n);
inline void deallocate(pointer p, size_type);
inline void construct(pointer p, const value_type & wert);
inline void destroy(pointer p) { p->~value_type(); }
inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};
template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
#if _MSC_VER
return (pointer)_aligned_malloc(n * sizeof(value_type), N);
#else
void * p0 = nullptr;
int r = posix_memalign(&p0, N, n * sizeof(value_type));
if (r != 0) return 0;
return (pointer)p0;
#endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
#if _MSC_VER
_aligned_free(p);
#else
std::free(p);
#endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
new (p) value_type(wert);
}
template <typename T, size_t N = 64>
using AlignedVector = std::vector<T, AlignmentAllocator<T, N>>;
template <size_t Align>
void Test() {
AlignedVector<float, Align> v(1);
size_t uptr = size_t(v.data()), alignment = 0;
while (!(uptr & 1)) {
++alignment;
uptr >>= 1;
}
std::cout << "Requested: " << Align << ", Actual: " << (1 << alignment) << std::endl;
}
int main() {
Test<8>();
Test<16>();
Test<32>();
Test<64>();
Test<128>();
Test<256>();
}
Output:
Requested: 8, Actual: 16
Requested: 16, Actual: 16
Requested: 32, Actual: 32
Requested: 64, Actual: 128
Requested: 128, Actual: 8192
Requested: 256, Actual: 256
You may see in code above that I used posix_memalign() for CLang/GCC and _aligned_malloc() for MSVC. Starting from C++17 there also exists std::aligned_alloc() but seems that not all compilers implemented it, at least MSVC didn't. Looks like on CLang/GCC you can use this std::aligned_alloc()
instead of posix_memalign()
as commented by @Mgetz.
Also as Intel guide says here you can use _mm_malloc()
and _mm_free()
instead of posix_memalign()
/_aligned_malloc()
/_aligned_free()
/std::aligned_alloc()
/std::free()
.
Upvotes: 3