Memory alignment issues with GCC Vector Extension and ARM NEON

Question

Problem Description

I'm trying to write NEON optimized code using the GCC vector extension. Therefore I defined a union structure like

#include 

typedef int32_t    v4si __attribute__ ((vector_size (16)));
typedef float32_t  v4sf __attribute__ ((vector_size (16)));

union v128
{
    int32x4_t   m128i;
    float32x4_t m128f;
    v4si        si;
    v4sf        sf;
};

v128 x,y;

Writing code like x.sf *= y.sf often leads to crashes due to bus errors. A check with gdb always reveals that in all these crash cases at least one variable is only aligned to 8 bytes and not to 16 bytes. However, when I compile with the optimization option "-O2" these crash cases occur much rarer.

Is there any gcc/g++ compiler option which always guarantees a 16 bit alignment for GCC vectors? Since "-O2" enables an entire bundle of optimizations, does anyone know which particular optimization leads to this much lower frequency of bus errors?

I am compiling and testing my code on a raspberry pi 3. There I also use the g++ parameters:

-march=armv8-a+crc -mtune=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -funsafe-math-optimizations

Minimal Code Example

simd_numeric_test.cpp:

#include 
#include 
#include 
#include 
#include 
#include 


typedef int32_t    v4si __attribute__ ((vector_size (16), aligned(16)));
typedef float32_t  v4sf __attribute__ ((vector_size (16), aligned(16)));


typedef int32x4_t   m128i_t; // __attribute__ ((aligned(16)));
typedef float32x4_t m128f_t; // __attribute__ ((aligned(16)));

union v128
{
    m128i_t m128i;
    m128f_t m128f;
    v4si    si;
    v4sf    sf;
};
static_assert( sizeof(v128) == 16 );


struct vf32_t
{
    v128 val;

    static constexpr size_t num_items() { return (sizeof(val) / sizeof(float32_t)); }

    inline
    const vf32_t& operator+=( const vf32_t& other ) { val.sf += other.val.sf; return *this; }

    inline
    const float32_t* cbegin() const { return &(val.sf[0]); }

    inline
    const float32_t* cend() const { return &(val.sf[num_items()]); }
};
static_assert( sizeof(vf32_t) == 16 );


class CSimdNumericTest
{
protected:

    const size_t m_numElemInSimd     = vf32_t::num_items();
    
    const int m_randomSeed_u         = 69;
    const int m_repeats_u            = 10000;

    const float32_t m_maxFloatVal_f32;// = 43.f;

    std::default_random_engine                m_rand;
    std::uniform_real_distribution m_floatSampler;

    void test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 ) const;

public:

    void float32_base_op_test();

    CSimdNumericTest()
        : m_maxFloatVal_f32( std::ceil( std::pow( std::numeric_limits::max(),
                                                  1.f / static_cast( m_numElemInSimd  ) ) ) )
        , m_rand( m_randomSeed_u )
        , m_floatSampler( -m_maxFloatVal_f32, m_maxFloatVal_f32 )
    {}
};

void CSimdNumericTest::test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 ) const
{
    vf32_t x = a_v32;

    x += b_v32;

    auto aIter = a_v32.cbegin();
    auto bIter = b_v32.cbegin();
    for ( auto xIter = x.cbegin(); xIter != x.cend();
           ++xIter, ++aIter, ++bIter ) {
        float32_t rx = *aIter;
        rx += *bIter;
        assert( rx == *xIter );
    }
}

void CSimdNumericTest::float32_base_op_test()
{
    vf32_t a_v32, b_v32;

    const float32_t l_minFloat_f32 = 1. / m_maxFloatVal_f32;

    for ( int n = 0; n < m_repeats_u; ++n )
    {
        for ( size_t i = 0; i < vf32_t::num_items(); ++i )
        {
            a_v32.val.sf[i] = m_floatSampler( m_rand );
            b_v32.val.sf[i] = m_floatSampler( m_rand );
        }
        test_binary_assign_vv_operation( a_v32, b_v32 );
    }
}

int main(int argc, char **argv) {
  
    CSimdNumericTest test;
    test.float32_base_op_test();
    return 0;
}

I compiled everything with

arm-linux-gnueabihf-g++ -c -o simd_numeric_test_neon.o simd_numeric_test.cpp -pipe -fsigned-char -pthread -ftree-vectorize -Wall -Wextra -Wdate-time -Wformat -Werror=format-security -ggdb3 -O0 -march=armv8-a+crc -mtune=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -funsafe-math-optimizations -Wno-psabi 
arm-linux-gnueabihf-g++ -pthread -lpthread -lstdc++ -o simd_test_neon simd_numeric_test_neon.o

The compiled results:

simd_numeric_test_neon.o object file
simd_test_neon executable

The crash appears at the assignment statement:

x += b_v32;

Godbolt link

further investigation results

Now I noticed that all the crashes occur when using pass-by-value function parameters. While the original vector variable is still correctly aligned, the copied function parameter is not anymore. Therefore the executable works correctly when I replace pass-by-value with pass-by-reference:

void test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 )

to

void test_binary_assign_vv_operation( const vf32_t& a_v32, const vf32_t& b_v32 )

I observed this pattern for all my cases of bus-error-crashes.

However this observation does not really bring a solution. There are plenty of functions (e.g. in the C++STL) that use pass-by-value.

Is there any g++ parameter hat enables also a correct memory alignment for vectorized function parameters? Could this be a g++ bug?

Many thanks in advance

Memory alignment issues with GCC Vector Extension and ARM NEON

Problem Description

Minimal Code Example

further investigation results

Answers (1)

Related Questions