ARM Neon in C: How to combine different 128bit data types while using intrinsics?

Question

TLTR

For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?

EXTENDED VERSION

Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:

for (int y = 0; y < rows; y += 2) {
    uint8_t* p_out = outBuffer + (y / 2) * outStride;
    uint8_t* p_in = inBuffer + y * inStride;
    for (int x = 0; x < cols; x += 2) {
         *p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
         p_out++;
         p_in+=2;
    }
}

Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.

Now I want to vectorize this. The idea is:

take 2 consecutive rows of pixels
load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
compute the minimum byte by byte between a and b. Store in a.
create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
compute the minimum byte by byte between a and b. Store in a.
store every second byte of a in the output image (discards half of the bytes)

I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.

For example, at point 3 one can use (from here):

uint8x16_t  vminq_u8(uint8x16_t a, uint8x16_t b);

And at point 4 one can use one of the following using a shift of 8 bits (from here):

uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);

That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)

HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union (Edit: although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.

What is the way to combine different data types while using ARM Neon intrinsics?

Antonio · Accepted Answer

For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.

In some situations, you might want to treat a vector as having a different type, without changing its value. A set of intrinsics is provided to perform this type of conversion.

So, assuming a and b are declared as:

uint8x16_t a, b;

Your point 4 can be written as^(*):

b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );

However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

_{(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):}

__m128i b = _mm_srli_si128(a,1);

ARM Neon in C: How to combine different 128bit data types while using intrinsics?

Answers (1)

Related Questions