Reputation: 16843
As part of my calculation I end up with 2 masks stored in 2 uint32x4_t variables. These came from VCEQ. For further processing I want to combine these into a single q-reg or d-reg. What's preferred approach in arm neon to do it?
uint16x8_t combineMasks(uint32x4_t mask_lo, uint32x4_t mask_hi)
{
uint16x4_t lo = vmovn_u32(mask_lo);
uint16x4_t hi = vmovn_u32(mask_hi);
return vcombine_u16(lo, hi);
}
Is there a better way to do it? In my case I vand
result mask later on with some values to find position of a min/max element.
Upvotes: 0
Views: 757
Reputation: 6354
// aarch32
vuzp.16 mask_lo, mask_hi // you can use either one.
// aarch64
uzp1 result.8h, mask_lo.8h, mask_hi.8h
Another example on the uselessness of intrinsux
: vuzp1
won't compile if your targets include aarch32
. In other words, you have to write both versions in intrinsux
anyway, if you want the maximum performance.
What's the point of intrinsux
? It's too much of a headache compared to the brutally simple assembly coding.
Upvotes: 2