Efficiently combine masks in arm neon

Question

As part of my calculation I end up with 2 masks stored in 2 uint32x4_t variables. These came from VCEQ. For further processing I want to combine these into a single q-reg or d-reg. What's preferred approach in arm neon to do it?

Simple solution:

uint16x8_t combineMasks(uint32x4_t mask_lo, uint32x4_t mask_hi)
{
    uint16x4_t lo = vmovn_u32(mask_lo);
    uint16x4_t hi = vmovn_u32(mask_hi);
    return vcombine_u16(lo, hi);
}

Is there a better way to do it? In my case I vand result mask later on with some values to find position of a min/max element.

Jake &#39;Alquimista&#39; LEE · Accepted Answer

// aarch32
vuzp.16     mask_lo, mask_hi        // you can use either one.

// aarch64
uzp1        result.8h, mask_lo.8h, mask_hi.8h

Another example on the uselessness of intrinsux: vuzp1 won't compile if your targets include aarch32. In other words, you have to write both versions in intrinsux anyway, if you want the maximum performance.

What's the point of intrinsux? It's too much of a headache compared to the brutally simple assembly coding.

Efficiently combine masks in arm neon

Answers (1)

Related Questions