Multiply bytes to produce 16-bits, without shifting

Question

Still learning the art of SIMD, I have a question: I have two packed 8-bits registers that I'd like to multiply-add with _mm_maddubs_epi16 (pmaddubsw) to obtain a 16-bits packed register.

I know that these bytes will produce always a number less that 256, so I'd like to avoid wasting the remaining 8 bits. For instance, the result of _mm_maddubs_epi16(v1, v2) should write the result in r where XX is, not where it will be (denoted with __).

v1  (04, 00, 0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01, 00)
v2  (04, 00, 0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01, 00)

r   (__, XX, __, XX, __, XX, __, XX, __, XX, __, XX, __, XX, __, XX)

Can I do this without shifting the result?

PS. I don't have a nice processor, I am limited to AVX instructions.

Peter Cordes · Accepted Answer

In your vector diagram, is the highest element at the left or the right? Are the XX locations in the most or least significant byte of the pmaddubsw result?

To get results in the low byte of a word, from inputs in the high byte of each word:

Use _mm_mulhi_epu16 so you're effectively doing (v1 << 8) * (v2 << 8) >> 16, producing the result in the opposite byte from the input words. Since you say the product is strictly less than 256, you'll get an 8-bit result in the low byte of each 16-bit word.

(If your inputs are signed, use _mm_mulhi_epi16, but then a negative result would be sign-extended to the full 16 bits.)

To get results in the high byte of a word, from inputs in the low byte

You'll need to change how you load / create one of the inputs so instead of

         MSB LSB | MSB LSB
v1_lo   (00, 04,   00, 0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01)
 element# 15 14   13   12 ...                                           0

you have this: (both using Intel's notation where the left element is the highest number, so vector shifts like _mm_slli_epi128 shift bytes to the left in the diagram).

         MSB LSB | MSB LSB 
v1_hi   (04, 00,   0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01, 00)
 element# 15 14   13   12 ...                                           0

With v2 still having its non-zero bytes in the high half of each word element, simply _mm_mullo_epi16(v1_hi, v2), and you'll get (v1 * v2) << 8 for free.

If you're already unpacking bytes with zeros to obtain v1 and v2, then unpack the other way. If you were using pmovzx (_mm_cvtepu8_epi16), then switch to using _mm_unpacklo_epi8(_mm_setzero_si128(), packed_v1 ).

If you were loading these vectors from memory in this already-zero-padded form, use an unaligned load offset by 1 byte so the zeros end up in the opposite location.

If what you really want is to start with input bytes that aren't unpacked with zeros to start with, I don't think you can avoid that. Or if you're masking instead of unpacking (to save shuffle-port throughput by using _mm_and_si128 instead), you're probably going to need a shift somewhere. You can shift instead of masking one way, though, using v1_hi = _mm_slli_epi16(v, 8): a left-shift by 8 with word granularity will knock leave the low byte zeroed.

Multiply bytes to produce 16-bits, without shifting

Answers (2)

To get results in the low byte of a word, from inputs in the high byte of each word:

To get results in the high byte of a word, from inputs in the low byte

Related Questions