pauldoo
pauldoo

Reputation: 18635

How to instruct compiler to generate unaligned loads for __m128

I've got some code that works with __m128 values. I'm using x86-64 SSE intrinsics on these values and I find that if the values are unaligned in memory I get a crash. This is due to my compiler (clang in this instance) generating only aligned load instructions.

Can I instruct my compiler to generate unaligned loads instead, either globally or for certain values (perhaps with an annotation of some kind)?


The reason I have unaligned values in the first place is that I'm trying to save memory. I have a struct roughly as follows:

#pragma pack(push, 4)
struct Foobar {
    __m128 a;
    __m128 b;
    int c;
};
#pragma pack(pop)

I am then creating an array of these structs. The 2nd element in the array starts at 36 bytes, which is not a multiple of 16.

I know I could switch to a structure of arrays representation, or remove the packing pragma (at the cost of increasing the size of the struct from 36 to 48 bytes); but I also know that unaligned loads aren't that expensive these days and would like to try that first.


Update to answer some of the comments below:

My actual code was closer to this:

struct Vector4 {
    __m128 data;
    Vector4(__m128 v) : data(v) {}
};
struct Foobar {
    Vector4 a;
    Vector4 b;
    int c;
}

I then have some utility functions such as:

inline Vector4 add( const Vector4& a, const Vector4 &b ) {
    return Vector4(_mm_add_ps(a.data, b.data));
}

inline Vector4 subtract( const Vector4& a, const Vector4& b ) {
    return Vector4(_mm_sub_ps(a.data, b.data));
}

// etc..

I use these utilities often in combination. Fake example:

Foobar myArray[1000];
myArray[i+1].b = sub(add(myArray[i].a, myArray[i].b), myArray[i+1].a);

When looking at "Z Bozon"'s answer my code effectively changed into:

struct Vector4 {
    float data[4];
};

inline Vector4 add( const Vector4& a, const Vector4 &b ) {
    Vector4 result;
    _mm_storeu_ps(result.data, _mm_add_ps(_mm_loadu_ps(a.data), _mm_loadu_ps(b.data)));
    return result;
}

My concern was that when the utility functions were used in combination as above, that the generated code might have redundant load/store instructions. It turns out this was not a problem. I tested my compiler (clang), and it had removed them all. I'll accept Z Bozon's answer.

Upvotes: 9

Views: 2834

Answers (4)

Giovanni Funchal
Giovanni Funchal

Reputation: 9200

Clang has -fmax-type-align. If you set -fmax-type-align=8 then no 16-byte aligned instruction will be generated.

Upvotes: 2

Z boson
Z boson

Reputation: 33679

In my opinion you should write your data structures using standard C++ constructions (of which __m128i is not). When you want to use intrinsics which are are not standard C++ you "enter SSE world" through intrinsics such as _mm_loadu_ps and you "leave SSE world" back to standard C++ with an intrinsic such as _mm_storeu_ps. Don't rely on implicit SSE loads and stores. I have seen too many mistakes on SO doing this.

In this case you should use

struct Foobar {
    float a[4];
    float b[4];
    int c;
};

then you can do

Foobar foo[16];

In this case foo[1] won't be 16 byte aligned but when you want to use SSE and leave standard C++ do

__m128 a4 = _mm_loadu_ps(foo[1].a);
__m128 b4 = _mm_loadu_ps(foo[1].b);
__m128 max = _mm_max_ps(a4,b4);
_mm_storeu_ps(array, max);

then go back to standard C++.

Another thing you can consider is this

struct Foobar {
    float a[16];
    float b[16];
    int c[4];
};

then to get an array of 16 of the original struct do

Foobar foo[4];

In this case as long the first element is aligned so are all the other elements.


If you want utility functions which act on SSE registers then don't use explicit or implicit load/stores in the utility function. Pass const references to __m128 and return __m128 if you need to.

//SSE utility function
static inline __m128 mulk_SSE(__m128 const &a, float k)
{
    return _mm_mul_ps(_mm_set1_ps(k),a);
}

//main function
void foo(float *x, float *y n) 
{
    for(int i=0; i<n; i+=4)
        __m128 t1 = _mm_loadu_ps(x[i]);
        __m128 t2 = mulk_SSE(x4,3.14159f);
        _mm_store_ps(&y[i], t2);
    }
}

The reason to use a const reference is that MSVC cannot pass __m128 by value. Without a const reference you get an error

error C2719: formal parameter with __declspec(align('16')) won't be aligned.

__m128 for MSVC is really a union anyway.

typedef union __declspec(intrin_type) _CRT_ALIGN(16) __m128 {
     float               m128_f32[4];
     unsigned __int64    m128_u64[2];
     __int8              m128_i8[16];
     __int16             m128_i16[8];
     __int32             m128_i32[4];
     __int64             m128_i64[2];
     unsigned __int8     m128_u8[16];
     unsigned __int16    m128_u16[8];
     unsigned __int32    m128_u32[4];
 } __m128;

presumably MSVC should not have to load the union when the SSE utility functions are inlined.


Based on the OPs latest code update here is what I would suggest

#include <x86intrin.h>
struct Vector4 {
    __m128 data;
    Vector4() {
    }
    Vector4(__m128 const &v) {
        data = v;
    }
    Vector4 & load(float const *x) {
        data = _mm_loadu_ps(x);
        return *this;
    }
    void store(float *x) const {
        _mm_storeu_ps(x, data);
    }
    operator __m128() const {
        return data;
    }
};

static inline Vector4 operator + (Vector4 const & a, Vector4 const & b) {
    return _mm_add_ps(a, b);
}

static inline Vector4 operator - (Vector4 const & a, Vector4 const & b) {
    return _mm_sub_ps(a, b);
}

struct Foobar {
    float a[4];
    float b[4];
    int c;
};

int main(void)
{
    Foobar myArray[10];
    // note that myArray[0].a, myArray[0].b, and myArray[1].b should be      // initialized before doing the following 
    Vector4 a0 = Vector4().load(myArray[0].a);
    Vector4 b0 = Vector4().load(myArray[0].b);
    Vector4 a1 = Vector4().load(myArray[1].a);        
    (a0 + b0 - a1).store(myArray[1].b);
}

This code was based on ideas from Agner Fog's Vector Class Library.

Upvotes: 4

zam
zam

Reputation: 1684

If you use auto-vectorization or explicit OpenMP4/Cilk/pragmas-driven vectorization, then you can enforce compiler to use unaligned loads for vectorized loop by using:

#pragma vector unaligned //for C/C++ 

CDEC$ vector unaligned ; for Fortran

This is primarily intended to control trade-offs between "aligned but peeled" vs. "not peeled, but unaligned". Read more details at https://software.intel.com/en-us/articles/utilizing-full-vectors

This only works for Intel Compilers as far as I know. Intel Compilers also have internal compilation switch -mP2OPT_vec_alignment=6 to do the same for the whole compilation unit.

I didn't check if it could be effectively applied to implementations where intrinsics/assembly is used together with OpenMP/Cilk.

Upvotes: 0

Paul R
Paul R

Reputation: 213060

You could try changing your struct to:

#pragma pack(push, 4)
struct Foobar {
    int c;
    __m128 a;
    __m128 b;
};
#pragma pack(pop)

That would have the same size of course, and should in theory force clang to generate unaligned loads/stores.


Alternatively you could use explicit unaligned loads/stores, e.g. change:

v = _mm_max_ps(myArray[300].a, myArray[301].a)

to:

__m128i v1 = _mm_loadu_ps((float *)&myArray[300].a);
__m128i v2 = _mm_loadu_ps((float *)&myArray[301].a);
v = _mm_max_ps(v1, v2);

Upvotes: 0

Related Questions