Reputation: 27
I'm trying to write NEON optimized code using the GCC vector extension. Therefore I defined a union structure like
#include <arm_neon.h>
typedef int32_t v4si __attribute__ ((vector_size (16)));
typedef float32_t v4sf __attribute__ ((vector_size (16)));
union v128
{
int32x4_t m128i;
float32x4_t m128f;
v4si si;
v4sf sf;
};
v128 x,y;
Writing code like x.sf *= y.sf
often leads to crashes due to bus errors.
A check with gdb always reveals that in all these crash cases at least one variable is only aligned to 8 bytes and not to 16 bytes.
However, when I compile with the optimization option "-O2" these crash cases occur much rarer.
Is there any gcc/g++ compiler option which always guarantees a 16 bit alignment for GCC vectors? Since "-O2" enables an entire bundle of optimizations, does anyone know which particular optimization leads to this much lower frequency of bus errors?
I am compiling and testing my code on a raspberry pi 3. There I also use the g++ parameters:
-march=armv8-a+crc -mtune=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -funsafe-math-optimizations
simd_numeric_test.cpp:
#include <random>
#include <limits>
#include <cfloat>
#include <type_traits>
#include <cassert>
#include <arm_neon.h>
typedef int32_t v4si __attribute__ ((vector_size (16), aligned(16)));
typedef float32_t v4sf __attribute__ ((vector_size (16), aligned(16)));
typedef int32x4_t m128i_t; // __attribute__ ((aligned(16)));
typedef float32x4_t m128f_t; // __attribute__ ((aligned(16)));
union v128
{
m128i_t m128i;
m128f_t m128f;
v4si si;
v4sf sf;
};
static_assert( sizeof(v128) == 16 );
struct vf32_t
{
v128 val;
static constexpr size_t num_items() { return (sizeof(val) / sizeof(float32_t)); }
inline
const vf32_t& operator+=( const vf32_t& other ) { val.sf += other.val.sf; return *this; }
inline
const float32_t* cbegin() const { return &(val.sf[0]); }
inline
const float32_t* cend() const { return &(val.sf[num_items()]); }
};
static_assert( sizeof(vf32_t) == 16 );
class CSimdNumericTest
{
protected:
const size_t m_numElemInSimd = vf32_t::num_items();
const int m_randomSeed_u = 69;
const int m_repeats_u = 10000;
const float32_t m_maxFloatVal_f32;// = 43.f;
std::default_random_engine m_rand;
std::uniform_real_distribution<float32_t> m_floatSampler;
void test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 ) const;
public:
void float32_base_op_test();
CSimdNumericTest()
: m_maxFloatVal_f32( std::ceil( std::pow( std::numeric_limits<float32_t>::max(),
1.f / static_cast<float32_t>( m_numElemInSimd ) ) ) )
, m_rand( m_randomSeed_u )
, m_floatSampler( -m_maxFloatVal_f32, m_maxFloatVal_f32 )
{}
};
void CSimdNumericTest::test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 ) const
{
vf32_t x = a_v32;
x += b_v32;
auto aIter = a_v32.cbegin();
auto bIter = b_v32.cbegin();
for ( auto xIter = x.cbegin(); xIter != x.cend();
++xIter, ++aIter, ++bIter ) {
float32_t rx = *aIter;
rx += *bIter;
assert( rx == *xIter );
}
}
void CSimdNumericTest::float32_base_op_test()
{
vf32_t a_v32, b_v32;
const float32_t l_minFloat_f32 = 1. / m_maxFloatVal_f32;
for ( int n = 0; n < m_repeats_u; ++n )
{
for ( size_t i = 0; i < vf32_t::num_items(); ++i )
{
a_v32.val.sf[i] = m_floatSampler( m_rand );
b_v32.val.sf[i] = m_floatSampler( m_rand );
}
test_binary_assign_vv_operation( a_v32, b_v32 );
}
}
int main(int argc, char **argv) {
CSimdNumericTest test;
test.float32_base_op_test();
return 0;
}
I compiled everything with
arm-linux-gnueabihf-g++ -c -o simd_numeric_test_neon.o simd_numeric_test.cpp -pipe -fsigned-char -pthread -ftree-vectorize -Wall -Wextra -Wdate-time -Wformat -Werror=format-security -ggdb3 -O0 -march=armv8-a+crc -mtune=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -funsafe-math-optimizations -Wno-psabi
arm-linux-gnueabihf-g++ -pthread -lpthread -lstdc++ -o simd_test_neon simd_numeric_test_neon.o
The compiled results:
The crash appears at the assignment statement:
x += b_v32;
Now I noticed that all the crashes occur when using pass-by-value function parameters. While the original vector variable is still correctly aligned, the copied function parameter is not anymore. Therefore the executable works correctly when I replace pass-by-value with pass-by-reference:
void test_binary_assign_vv_operation( const vf32_t a_v32, const vf32_t b_v32 )
to
void test_binary_assign_vv_operation( const vf32_t& a_v32, const vf32_t& b_v32 )
I observed this pattern for all my cases of bus-error-crashes.
However this observation does not really bring a solution. There are plenty of functions (e.g. in the C++STL) that use pass-by-value.
Is there any g++ parameter hat enables also a correct memory alignment for vectorized function parameters? Could this be a g++ bug?
Many thanks in advance
Upvotes: 1
Views: 1500
Reputation: 58741
I agree with you that this is a bug in gcc on ARM / AArch64 and several other targets (but not x86).
The problem seems to arise when you have a type requiring extra alignment, but which can be passed in a register. If you pass such an object as a function argument, and the called function takes its address, the object is spilled to the stack but without the necessary alignment. The unaligned object may then be passed by reference to another function, causing the crash.
It can be reproduced in C and without vectors. Here is a test case; compile with -O0
to avoid inlining. (But the function itself is still miscompiled even with optimization on.)
#include <stdio.h>
typedef int V __attribute__((aligned(64)));
void f3(V *p) {
printf("%p\n", (void *)p);
}
void f2(V x) {
//volatile int blah = 17;
f3(&x);
}
int main(void) {
f2(-43);
return 0;
}
With gcc up through 10.2, on both arm-linux-gnueabihf
and aarch64-linux-gnu
, this prints addresses that are not 64-byte aligned. (You may have to uncomment the volatile int
declaration just in case the stack is properly aligned by coincidence.)
Inspecting the generated assembly shows that gcc spills x
to the stack and makes no attempt to align it. ABI stack alignment is, I believe, only 8 bytes for ARM, and 16 bytes for AArch64, so manual alignment would be needed.
On ARM:
f2:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
str r0, [r7]
mov r3, r7
mov r0, r3
bl f3(PLT)
nop
adds r7, r7, #8
mov sp, r7
pop {r7, pc}
On AArch64:
f2:
stp x29, x30, [sp, -32]!
mov x29, sp
str w0, [sp, 16]
add x0, sp, 16
bl f3
nop
ldp x29, x30, [sp], 32
ret
You can work around the bug in your own functions by assigning the function parameter to a temporary variable and passing that instead, but of course, as you say, that doesn't help with functions generated from standard library templates.
It looks like clang handles the alignment properly, so that may be another option for you.
Update: The bug is present in gcc trunk as of 20201010, and I was also able to reproduce it on alpha, sparc64 and mips targets (in emulation). However, x86-64 generates proper alignment code. I've reported this as gcc bug 97473.
Upvotes: 2