Minhaz
Minhaz

Reputation: 977

NEON code faster than standard C code on armeabi-v7a but slower on arm64-v8a

I was trying out hello neon sample provided by android/ndk-samples and tested the fir filter demo on two devices one with armeabi-v7a support and other with arm64-v8a ABI.

By default the JNI code fails for arm64-v8a but that can be addressed with some tweaks. Now when I finally run the comparison code on the two devices (with diff specs) I get following results

armeabi-v7a device - quad core, 32bit

C Version:      182.47 ms
Neon Version:   69.782 ms (2.62x faster)

arm64-v8a device - octa core, 64bit

C Version:      10.189 ms
Neon Version:   19.4836 ms (0.52295x faster)

Question

Why does this neon version slow down for the arm64-v8a?

(I am fairly new to NEON and SIMD)

Link to intrinsic code - cpp/helloneon-intrinsics.c

Upvotes: 1

Views: 758

Answers (1)

Soonts
Soonts

Reputation: 21936

Why does this neon version slow down for the arm64-v8a?

Because the code you have linked was written in 2016 for ARMv7.

In ARMv7, 8-byte NEON registers are individually addressable.

ARM64 SIMD is 16-bytes all the way down, 8-byte vectors have non-trivial performance penalty.

Try this version:

void fir_filter_neon_arm64( short *output, const short* input, const short* kernel, size_t width, size_t kernelSize )
{
    const ptrdiff_t offset = -(ptrdiff_t)kernelSize / 2;
    const size_t kernelSizeAligned = ( kernelSize / 8 ) * 8;
    const bool extraVector = ( kernelSize % 8 ) >= 4;

    for( size_t outer = 0; outer < width; outer++ )
    {
        const short* const inputPtr = input + outer + offset;
        // Handle stuff 16 bytes at a time.
        // Using 2 independent accumulators to improve data dependency situation on these accumulators.
        int32x4_t acc1 = vdupq_n_s32( 0 );
        int32x4_t acc2 = vdupq_n_s32( 0 );

        size_t ii = 0;
        for( ; ii < kernelSizeAligned; ii += 8 )
        {
            int16x8_t kernel_vec = vld1q_s16( kernel + ii );
            int16x8_t input_vec = vld1q_s16( inputPtr + ii );
            acc1 = vmlal_s16( acc1, vget_low_s16( kernel_vec ), vget_low_s16( input_vec ) );
            acc2 = vmlal_high_s16( acc2, kernel_vec, input_vec );
        }
        if( extraVector )
        {
            // The remainder was longer than 4, use SIMD for the first 4 of the remaining elements
            int16x4_t kernel_vec = vld1_s16( kernel + ii );
            int16x4_t input_vec = vld1_s16( inputPtr + ii );
            acc1 = vmlal_s16( acc1, kernel_vec, input_vec );
            ii += 4;
        }

        // Add these two accumulators together
        acc1 = vaddq_s32( acc1, acc2 );
        // Horizontal sum into the scalar, ARM64 has an instruction for that
        const int sumVector = vaddvq_s32( acc1 );

        // Handle the final 0-3 elements
        int sumRemainder = 0;
        for( ; ii < kernelSize; ii++ )
            sumRemainder += (int)( kernel[ ii ] ) * (int)( inputPtr[ ii ] );

        // Store the final result
        const int sum = sumVector + sumRemainder;
        output[ outer ] = (short)( ( sum + 0x8000 ) >> 16 );
    }
}

If building with GCC, make sure to -O3 -fno-tree-vectorize otherwise the compiler automatically vectorizes that one last loop with the remainder, inflates code for no good reason. With that command-line switch, the code looks reasonable.

Upvotes: 5

Related Questions