uint8 to float using SIMD Neon intrinsics

Question

I'm trying to optimize my code that converts grayscale images to float images which runs on Neon A64/v8.

The current implementation is quite fast using OpenCV's convertTo() (that compiled for android), but this is still our bottleneck.

So I came up with the following code and would like to hear about possible improvements.

The image height and width are a factor of 16 if it can help.

I'm running for loops on this:

static void u8_2_f(unsigned char* in, float* out)
{
    //1 u8x8->u16x8
    uint8x8_t u8x8src = vld1_u8(in);
    uint16x8_t u16x8src = vmovl_u8(u8x8src);

    //2 u16x8 -> u32x4high, u32x4low
    uint32x4_t u32x4srch = vmovl_u16(vget_high_u16(u16x8src));
    uint32x4_t u32x4srcl = vmovl_u16(vget_low_u16(u16x8src));

    //3 u32x4high, u32x4low -> f32x4high, f32x4low
    vst1q_f32(out, vcvtq_f32_u32(u32x4srch));
    vst1q_f32(out+4, vcvtq_f32_u32(u32x4srcl));
}

uint8 to float using SIMD Neon intrinsics

Answers (1)

Related Questions