Pavel P
Pavel P

Reputation: 16892

Efficiently compute two dissimilar numbers in arm neon

I have an array of 16 integers and I'd like to find pair of ints from this array that have max dissimilarity between each other. dissimilarity could be computed with this (pseudo) code:

int diss(uint32_t x, uint32_t y)
{   // it could do square for each byte of the number instead.
    return
    abs(((x >> 24) & 0xFF) - ((y >> 24) & 0xFF)) + 
    abs(((x >> 16) & 0xFF) - ((y >> 16) & 0xFF)) + 
    abs(((x >>  8) & 0xFF) - ((y >>  8) & 0xFF)) + 
    abs(((x >>  0) & 0xFF) - ((y >>  0) & 0xFF));
}

void findDissimilar(uint32_t buf[16], uint32_t& x, uint32_t& y)
{
    int maxDiss = 0;
    for (int i=0; i<16; ++i)
    {
        for (int j=0; j<16; ++j)
        {
            int d = diss(buf[i], bud[j]);
            if (d > maxDiss)
            {
                maxDiss = d;
                x = buf[i];
                y = buf[j];
            }
        }
    }
}

On input buf is already in neon registers if that matters. On output I should get two ints (in neon reg perhaps it's better). How can I do that efficiently in arm neon, what approaches should I try? Just to clarify, the point of the question is about optimizing findDissimilar.

Upvotes: 1

Views: 247

Answers (1)

Pavel P
Pavel P

Reputation: 16892

diss is trivial to compute in neon, it could possibly implemented this way (untested code):

uint32x4_t diss_x4(uint32x4_t x4, uint32x4_t y4)
{
    uint8x16_t diff = vabdq_u8(vreinterpretq_u8_u32(x4), vreinterpretq_u8_u32(x4));
    uint16x8_t m0 = vmull_u8(vget_low_u8(diff), vget_low_u8(diff));
    uint16x8_t m1 = vmull_u8(vget_high_u8(diff), vget_high_u8(diff));
    uint16x4_t s0 = vpadd_u16(vget_low_u8(m0), vget_high_u8(m0));
    uint16x4_t s1 = vpadd_u16(vget_low_u8(m1), vget_high_u8(m1));
    uint16x4_t sx = vpadd_u16(s0, s1);
    return vmovl_u16(sx);
}

but not so trivial to findDissimilar. I think the best approach is to do the following: - load all 16 ints in 4 q-registers {q0, q1, q2, q3}. - first q0 reg contains { buf[0], buf[1], buf[2], buf[3] } - then I can loop 15 times and extract from the 4 input q regs into a qext values such as vextq_u32(q0, q1, 1) for the first iteration and so on. - compute minimums between q0 and qext.

then the same process should be done for q1, q2, q3.

Perhaps if I deinterleave {q0, q1, q2, q3} by byte it could be optimized even better.

Upvotes: 0

Related Questions