Reputation: 63
I am working on a native android app that should run on a ARMv7 processor device. For some reasons I need to do some heavy computation on vectors (short and/or float). I implemented some assembly function using NEON commands to boost the computation. I have gained a 1.5 speed factor which is not bad. I am wondering if I can improve these functions to go even faster.
So the question is: what changes can I do to improve these functions ?
//add to float vectors.
//the result could be put in scr1 instead of dst
void add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count)
{
asm volatile (
"1: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vadd.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 1b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a float vector by a scalar.
//the result could be put in scr1 instead of dst
void mul_float_vector_by_scalar_with_neon3(float* dst, float* src1, float scalar, int count)
{
asm volatile (
"vdup.32 q1, %[scalar] \n"
"2: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vmul.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 2b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [scalar] "r" (scalar), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//add to short vector -> no problem of coding limits
//the result should be put in in a dest different from src1 and scr2
void add_short_vector_with_neon3(short* dst, short* src1, short* src2, int count)
{
asm volatile (
"3: \n"
"vld1.16 {q0}, [%[src1]]! \n"
"vld1.16 {q1}, [%[src2]]! \n"
"vadd.i16 q0, q0, q1 \n"
"subs %[count], %[count], #8 \n"
"vst1.16 {q0}, [%[dst]]! \n"
"bgt 3b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a short vector by a float vector and put the result bach into a short vector
//the result should be put in in a dest different from src1
void mul_short_vector_by_float_vector_with_neon3(short* dst, short* src1, float* src2, int count)
{
asm volatile (
"4: \n"
"vld1.16 {d0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vmovl.s16 q0, d0 \n"
"vcvt.f32.s32 q0, q0 \n"
"vmul.f32 q0, q0, q1 \n"
"vcvt.s32.f32 q0, q0 \n"
"vmovn.s32 d0, q0 \n"
"subs %[count], %[count], #4 \n"
"vst1.16 {d0}, [%[dst]]! \n"
"bgt 4b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "d0", "q0", "q1"
);
}
Thanks in advance !
Upvotes: 4
Views: 413
Reputation: 63
OK, I compared between the code given in the initial post and a new function proposed by Josejulio:
void add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count)
{
asm volatile (
"1: \n"
"vld1.32 {q0,q1}, [%[src1]]! \n"
"vld1.32 {q2,q3}, [%[src2]]! \n"
"vadd.f32 q0, q0, q2 \n"
"vadd.f32 q1, q1, q3 \n"
"vld1.32 {q4,q5}, [%[src1]]! \n"
"vld1.32 {q6,q7}, [%[src2]]! \n"
"vadd.f32 q4, q4, q6 \n"
"vadd.f32 q5, q5, q7 \n"
"subs %[count], %[count], #16 \n"
"vst1.32 {q0, q1}, [%[dst]]! \n"
"vst1.32 {q4, q5}, [%[dst]]! \n"
"bgt 1b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7"
);
}
Whereas in the tool (pulsar.webshaker.net/ccc/index.php), there is a big difference in CPU cylcles/float, I don't see much difference in latency checking:
median, firstQuartile, thirdQuartile, minVal, maxVal (micro-sec, 1000 measures)
original: 3564, 3206, 5126, 1761, 12144
unrolled: 3567, 3080, 4877, 3018, 11683
So I am not sure that the unrollment is so efficient...
Upvotes: 0
Reputation: 1295
You can try unrolling your loop to process more elements per loop.
Your code for add_float_vector_with_neon3 takes 10 cycles (because of stalling) per 4 elements, while unrolling to 16 elements consumes 21 cycles. http://pulsar.webshaker.net/ccc/sample-34e5f701
Though there is a overhead because you need to process the remainder (or you can pad your data to be multiple of 16), but if you have lots of data, the overhead should be fairly low compared to the actual sum.
Upvotes: 1
Reputation: 2817
This is an example on how you can code it with neon instrinsics.
The advantage is that you can use the compiler to optimize register allocation and instruction scheduling while constrain the instruction usage.
The downside is that, GCC doesn't seem able to combine the pointer arithmetic into the load/store instruction so additional ALU instructions are issued to do it. Or maybe I'm wrong and GCC has a good reason of doing it that way.
With GCC and CFLAGS=-std=gnu11 -O3 -fgcse-lm -fgcse-sm -fgcse-las -fgcse-after-reload -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -fPIE -Wall
this code compiles to very decent object code. Loop is unrolled and interleaved to hide the long delay before the result of a load is available. And it's also readable, too.
#include <arm_neon.h>
#define ASSUME_ALIGNED_FLOAT_128(ptr) ((float *)__builtin_assume_aligned((ptr), 16))
__attribute__((optimize("unroll-loops")))
void add_float_vector_with_neon3( float *restrict dst,
const float *restrict src1,
const float *restrict src2,
size_t size)
{
for(int i=0;i<size;i+=4){
float32x4_t inFloat41 = vld1q_f32(ASSUME_ALIGNED_FLOAT_128(src1));
float32x4_t inFloat42 = vld1q_f32(ASSUME_ALIGNED_FLOAT_128(src2));
float32x4_t outFloat64 = vaddq_f32 (inFloat41, inFloat42);
vst1q_f32 (ASSUME_ALIGNED_FLOAT_128(dst), outFloat64);
src1+=4;
src2+=4;
dst+=4;
}
}
Upvotes: 0