Reputation: 62333
I've just started trying to optimised some android code using NEON. I'm having a few issues, however. The main issue is that I really can't work out how to do a quick 16-bit to float conversion.
I see its possible to convert multiple 32-bit ints to float in 1 SIMD instruction using vcvt.s32.f32. However how do I convert a set of 4 S16s to 4 S32s? I assume it has something to do with the VUZP instruction but I cannot figure out how...
Equally I see that its possible to use VCVT.s16.f32 to convert 1 16-bit to a float at a time but while this is helpful it seems very wasteful not to be able to do it using SIMD.
I've written assembler on many different platforms over the years but I find the ARM documentation completely unfathomable for some reason.
As such any help would be HUGELY appreciated.
Also is there any way to get the throughput and latency figures for the NEON unit?
Thanks in advance!
Upvotes: 2
Views: 3148
Reputation: 6354
why use vaddl wasting cycles initializing a valuable register with 0?
vmovl.s16 q0, d1
then convert q0
that will do.
My question is :
PS : strange, I think ARM's PDF to be the best around.
Upvotes: 1
Reputation: 5655
If no other computation is to be done along with the conversion from 16bit integer to 32bit integer you can go for uint32x4_t = vmovl_u16 (uint16x4_t)
If any simple addition or multiplication etc is being performed before the conversion, you can combine them in a single instruction like int32x4_t = vmull_u16 (int16x4_t, int16x4_t) or int32x4_t = vaddl_u16 (int16x4_t, int16x4_t) etc and thus saving some amount of cycles.
Upvotes: 4
Reputation: 30480
Elaborating a small bit on my comment: you want to "widen" the 4 16-bit registers to 4 32-bit integers before converting to 4 32-bit floats. Looking at the instruction set I don't think there are any faster conversion paths, but I could easily be wrong.
The direct method is to use vaddl.s16
with a second operand of four zeros, but unless you're only doing conversion you can often combine the conversion with a previous operation. E.g. if you're multiplying two int16x4 registers you can use vmull.s16
to get 32-bit output directly rather than first multiplying and widening later (provided you're not depending on any truncating behavior).
Upvotes: 2