Reputation: 977
I was trying out hello neon sample provided by android/ndk-samples
and tested the fir filter demo on two devices one with armeabi-v7a
support and other with arm64-v8a
ABI.
By default the JNI code fails for arm64-v8a
but that can be addressed with some tweaks. Now when I finally run the comparison code on the two devices (with diff specs) I get following results
armeabi-v7a
device - quad core, 32bitC Version: 182.47 ms
Neon Version: 69.782 ms (2.62x faster)
arm64-v8a
device - octa core, 64bitC Version: 10.189 ms
Neon Version: 19.4836 ms (0.52295x faster)
Why does this neon version slow down for the arm64-v8a
?
(I am fairly new to NEON and SIMD)
Link to intrinsic code - cpp/helloneon-intrinsics.c
Upvotes: 1
Views: 758
Reputation: 21936
Why does this neon version slow down for the arm64-v8a?
Because the code you have linked was written in 2016 for ARMv7.
In ARMv7, 8-byte NEON registers are individually addressable.
ARM64 SIMD is 16-bytes all the way down, 8-byte vectors have non-trivial performance penalty.
Try this version:
void fir_filter_neon_arm64( short *output, const short* input, const short* kernel, size_t width, size_t kernelSize )
{
const ptrdiff_t offset = -(ptrdiff_t)kernelSize / 2;
const size_t kernelSizeAligned = ( kernelSize / 8 ) * 8;
const bool extraVector = ( kernelSize % 8 ) >= 4;
for( size_t outer = 0; outer < width; outer++ )
{
const short* const inputPtr = input + outer + offset;
// Handle stuff 16 bytes at a time.
// Using 2 independent accumulators to improve data dependency situation on these accumulators.
int32x4_t acc1 = vdupq_n_s32( 0 );
int32x4_t acc2 = vdupq_n_s32( 0 );
size_t ii = 0;
for( ; ii < kernelSizeAligned; ii += 8 )
{
int16x8_t kernel_vec = vld1q_s16( kernel + ii );
int16x8_t input_vec = vld1q_s16( inputPtr + ii );
acc1 = vmlal_s16( acc1, vget_low_s16( kernel_vec ), vget_low_s16( input_vec ) );
acc2 = vmlal_high_s16( acc2, kernel_vec, input_vec );
}
if( extraVector )
{
// The remainder was longer than 4, use SIMD for the first 4 of the remaining elements
int16x4_t kernel_vec = vld1_s16( kernel + ii );
int16x4_t input_vec = vld1_s16( inputPtr + ii );
acc1 = vmlal_s16( acc1, kernel_vec, input_vec );
ii += 4;
}
// Add these two accumulators together
acc1 = vaddq_s32( acc1, acc2 );
// Horizontal sum into the scalar, ARM64 has an instruction for that
const int sumVector = vaddvq_s32( acc1 );
// Handle the final 0-3 elements
int sumRemainder = 0;
for( ; ii < kernelSize; ii++ )
sumRemainder += (int)( kernel[ ii ] ) * (int)( inputPtr[ ii ] );
// Store the final result
const int sum = sumVector + sumRemainder;
output[ outer ] = (short)( ( sum + 0x8000 ) >> 16 );
}
}
If building with GCC, make sure to -O3 -fno-tree-vectorize
otherwise the compiler automatically vectorizes that one last loop with the remainder, inflates code for no good reason. With that command-line switch, the code looks reasonable.
Upvotes: 5