Alf
Alf

Reputation: 27

Memory alignment issues with GCC Vector Extension and ARM NEON

Problem Description

I'm trying to write NEON optimized code using the GCC vector extension. Therefore I defined a union structure like

#include <arm_neon.h>

typedef int32_t    v4si __attribute__ ((vector_size (16)));
typedef float32_t  v4sf __attribute__ ((vector_size (16)));

union v128
{
    int32x4_t   m128i;
    float32x4_t m128f;
    v4si        si;
    v4sf        sf;
};

v128 x,y;

Writing code like x.sf *= y.sf often leads to crashes due to bus errors. A check with gdb always reveals that in all these crash cases at least one variable is only aligned to 8 bytes and not to 16 bytes. However, when I compile with the optimization option "-O2" these crash cases occur much rarer.

Is there any gcc/g++ compiler option which always guarantees a 16 bit alignment for GCC vectors? Since "-O2" enables an entire bundle of optimizations, does anyone know which particular optimization leads to this much lower frequency of bus errors?

I am compiling and testing my code on a raspberry pi 3. There I also use the g++ parameters:

-march=armv8-a+crc -mtune=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -funsafe-math-optimizations

Minimal Code Example

simd_numeric_test.cpp:

#include <random>
#include <limits>
#include <cfloat>
#include <type_traits>
#include <cassert>
#include <arm_neon.h>


typedef int32_t    v4si __attribute__ ((vector_size (16), aligned(16)));
typedef float32_t  v4sf __attribute__ ((vector_size (16), aligned(16)));


typedef int32x4_t   m128i_t; // __attribute__ ((aligned(16)));
typedef float32x4_t m128f_t; // __attribute__ ((aligned(16)));

union v128
{
    m128i_t m128i;
    m128f_t m128f;
    v4si    si;
    v4sf    sf;
};
static_assert( sizeof(v128) == 16 );


struct vf32_t
{
    v128 val;

    static constexpr size_t num_items() { return (sizeof(val) / sizeof(float32_t)); }

    inline
    const vf32_t& operator+=( const vf32_t& other ) { val.sf += other.val.sf; return *this; }

    inline
    const float32_t* cbegin() const { return &(val.sf[0]); }

    inline
    const float32_t* cend() const { return &(val.sf[num_items()]); }
};
static_assert( sizeof(vf32_t) == 16 );


class CSimdNumericTest
{
protected:

    const size_t m_numElemInSimd     = vf32_t::num_items();
    
    const int m_randomSeed_u         = 69;
    const int m_repeats_u            = 10000;

    const float32_t m_maxFloatVal_f32;// = 43.f;

    std::default_random_engine                m_rand;
    std::uniform_real_distribution<float32_t> m_floatSampler;

    void test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 ) const;

public:

    void float32_base_op_test();

    CSimdNumericTest()
        : m_maxFloatVal_f32( std::ceil( std::pow( std::numeric_limits<float32_t>::max(),
                                                  1.f / static_cast<float32_t>( m_numElemInSimd  ) ) ) )
        , m_rand( m_randomSeed_u )
        , m_floatSampler( -m_maxFloatVal_f32, m_maxFloatVal_f32 )
    {}
};

void CSimdNumericTest::test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 ) const
{
    vf32_t x = a_v32;

    x += b_v32;

    auto aIter = a_v32.cbegin();
    auto bIter = b_v32.cbegin();
    for ( auto xIter = x.cbegin(); xIter != x.cend();
           ++xIter, ++aIter, ++bIter ) {
        float32_t rx = *aIter;
        rx += *bIter;
        assert( rx == *xIter );
    }
}

void CSimdNumericTest::float32_base_op_test()
{
    vf32_t a_v32, b_v32;

    const float32_t l_minFloat_f32 = 1. / m_maxFloatVal_f32;

    for ( int n = 0; n < m_repeats_u; ++n )
    {
        for ( size_t i = 0; i < vf32_t::num_items(); ++i )
        {
            a_v32.val.sf[i] = m_floatSampler( m_rand );
            b_v32.val.sf[i] = m_floatSampler( m_rand );
        }
        test_binary_assign_vv_operation( a_v32, b_v32 );
    }
}

int main(int argc, char **argv) {
  
    CSimdNumericTest test;
    test.float32_base_op_test();
    return 0;
}

I compiled everything with

arm-linux-gnueabihf-g++ -c -o simd_numeric_test_neon.o simd_numeric_test.cpp -pipe -fsigned-char -pthread -ftree-vectorize -Wall -Wextra -Wdate-time -Wformat -Werror=format-security -ggdb3 -O0 -march=armv8-a+crc -mtune=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -funsafe-math-optimizations -Wno-psabi 
arm-linux-gnueabihf-g++ -pthread -lpthread -lstdc++ -o simd_test_neon simd_numeric_test_neon.o

The compiled results:

The crash appears at the assignment statement:

x += b_v32;

Godbolt link

further investigation results

Now I noticed that all the crashes occur when using pass-by-value function parameters. While the original vector variable is still correctly aligned, the copied function parameter is not anymore. Therefore the executable works correctly when I replace pass-by-value with pass-by-reference:

void test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 )

to

void test_binary_assign_vv_operation( const vf32_t& a_v32, const vf32_t& b_v32 )

I observed this pattern for all my cases of bus-error-crashes.

However this observation does not really bring a solution. There are plenty of functions (e.g. in the C++STL) that use pass-by-value.

Is there any g++ parameter hat enables also a correct memory alignment for vectorized function parameters? Could this be a g++ bug?

Many thanks in advance

Upvotes: 1

Views: 1500

Answers (1)

Nate Eldredge
Nate Eldredge

Reputation: 58741

I agree with you that this is a bug in gcc on ARM / AArch64 and several other targets (but not x86).

The problem seems to arise when you have a type requiring extra alignment, but which can be passed in a register. If you pass such an object as a function argument, and the called function takes its address, the object is spilled to the stack but without the necessary alignment. The unaligned object may then be passed by reference to another function, causing the crash.

It can be reproduced in C and without vectors. Here is a test case; compile with -O0 to avoid inlining. (But the function itself is still miscompiled even with optimization on.)

#include <stdio.h>

typedef int V __attribute__((aligned(64)));

void f3(V *p) {
  printf("%p\n", (void *)p);
}

void f2(V x) {
    //volatile int blah = 17;
    f3(&x);
}

int main(void) {
  f2(-43);
  return 0;
}

With gcc up through 10.2, on both arm-linux-gnueabihf and aarch64-linux-gnu, this prints addresses that are not 64-byte aligned. (You may have to uncomment the volatile int declaration just in case the stack is properly aligned by coincidence.)

Inspecting the generated assembly shows that gcc spills x to the stack and makes no attempt to align it. ABI stack alignment is, I believe, only 8 bytes for ARM, and 16 bytes for AArch64, so manual alignment would be needed.

On ARM:

f2:
        push    {r7, lr}
        sub     sp, sp, #8
        add     r7, sp, #0
        str     r0, [r7]
        mov     r3, r7
        mov     r0, r3
        bl      f3(PLT)
        nop
        adds    r7, r7, #8
        mov     sp, r7
        pop     {r7, pc}

On AArch64:

f2:
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        str     w0, [sp, 16]
        add     x0, sp, 16
        bl      f3
        nop
        ldp     x29, x30, [sp], 32
        ret

You can work around the bug in your own functions by assigning the function parameter to a temporary variable and passing that instead, but of course, as you say, that doesn't help with functions generated from standard library templates.

It looks like clang handles the alignment properly, so that may be another option for you.

Update: The bug is present in gcc trunk as of 20201010, and I was also able to reproduce it on alpha, sparc64 and mips targets (in emulation). However, x86-64 generates proper alignment code. I've reported this as gcc bug 97473.

Upvotes: 2

Related Questions