How to use arm neon 8bit multiply add sum into 32 bit vector ？

Question

I am doing 8bit fixed-point work，I have A array and B array both of them are Q7 format，I want get the sum product of them。 demo code :

int8_t ra1[], ra2[], rb[];
int8x16_t va1, va2, vb;
int16x4_t vsum1, vsum2;
va1 = vld1q_s8(ra1);
va2 = vld1q_s8(ra2);
vb = vld1q_s8(rb);
vsum1 = vdup_n_s16(0);
vsum2 = vdup_n_s16(0);
    for (......)
    vsum1 = vmlal_s8(vsum1, vget_high_s8(va1), vget_high_s8(vb));
    vsum1 = vmlal_s8(vsum1, vget_low_s8(va1), vget_low_s8(vb));

sum+=a * b; this sum is 16bit ,it can easy overflow,because a*b is Q7×Q7 16bit can represent Q15.Also,I can't shift right the Q7xQ7 result,I need keep the high precision. how can I use neon，I want sum is 32bit a，b still is 8bit。I don't want transfer a and b to 16bit and use vmlal_s16,it will be slowly.I just need one instruction which can do multiply and add with one instruction time. The neon c intrinsics don't have this function，maybe neon assembly code can do this.Who can help me? thanks. Here is the vmla assembly code infomation. Maybe I can use it .Please give some advice,I don't familiar assembly code .

ErmIg · Accepted Answer

I hope this code example helps you:

inline int32x4_t Correlation(const int8x16_t & a, const int8x16_t & b)
{
    int16x8_t lo = vmull_s8(vget_low_s8(a), vget_low_s8(b));
    int16x8_t hi = vmull_s8(vget_high_s8(a), vget_high_s8(b));
    return vaddq_s32(vpaddlq_s16(lo), vpaddlq_s16(hi));
}

void CorrelationSum(const int8_t * a, const int8_t * b, size_t bStride, size_t size, int32_t * sum)
{
    int32x4_t sums = vdupq_n_s32(0);
    for (size_t i = 0; i < size; i += 16)
        sums = vaddq_s32(sums, Correlation(vld1q_s8(a + i), vld1q_s8(b + i)));
    *sum = vgetq_lane_s32(sums, 0) + vgetq_lane_s32(sums, 1) + vgetq_lane_s32(sums, 2) + vgetq_lane_s32(sums, 3);
}

Note: this example is based on the function Simd::Neon::CorrelationSum(). Also I would recommend to use following function Load() instead of vld1q_s8():

inline int8x16_t Load(const int8_t * p)
{
#ifdef __GNUC__
    __builtin_prefetch(p + 384);
#endif
    return vld1q_s8(p);
}

Using of prefetch adds 15-20% to performance.

How to use arm neon 8bit multiply add sum into 32 bit vector ？

Answers (1)

Related Questions