Reputation: 25
I have a lot of calculations with complex numbers (usually an array containing a struct consisting of two floats to represent im and re; see below) and want to speed them up with the NEON C intrinsics. It would be awesome if you could give me an example of how to speed up things like this:
for(n = 0;n < 1024;n++,p++,ptemp++){ // get cir_abs, also find the biggest point (value and location).
abs_squared = (Uns32)(((Int32)(p->re)) * ((Int32)(p->re))
+ ((Int32)(p->im)) * ((Int32)(p->im)));
// ...
}
p is an array of this kind:
typedef struct {
Int16 re;
Int16 im;
} Complex;
I already read through chapter 12 of "ARM C Language Extensions" but still have problems in understanding how to load and store my kind of construct here to do the calculations on it.
Upvotes: 0
Views: 2356
Reputation: 12273
Use vld2*
intrinsics to split re
and im
into different registers upon load, and then process them separately, e.g.
Complex array[16];
const int16x8x2_t vec_complex = vld2q_s16((const int16_t*)array);
const int16x8_t vec_re = vec_complex.val[0];
const int16x8_t vec_im = vec_complex.val[1];
const int16x8_t vec_abssq = vmlaq_s16(vmulq_s16(vec_re, vec_re), vec_im, vec_im);
For the above code clang 3.3 generates
vld2.16 {d18, d19, d20, d21}, [r0]
vmul.i16 q8, q10, q10
vmla.i16 q8, q9, q9
Upvotes: 5