Reputation: 2499
Still learning the art of SIMD, I have a question: I have two packed 8-bits registers that I'd like to multiply-add with _mm_maddubs_epi16
(pmaddubsw
) to obtain a 16-bits packed register.
I know that these bytes will produce always a number less that 256, so I'd like to avoid wasting the remaining 8 bits. For instance, the result of _mm_maddubs_epi16(v1, v2)
should write the result in r
where XX
is, not where it will be (denoted with __
).
v1 (04, 00, 0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01, 00)
v2 (04, 00, 0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01, 00)
r (__, XX, __, XX, __, XX, __, XX, __, XX, __, XX, __, XX, __, XX)
Can I do this without shifting the result?
PS. I don't have a nice processor, I am limited to AVX instructions.
Upvotes: 1
Views: 455
Reputation: 364180
In your vector diagram, is the highest element at the left or the right? Are the XX
locations in the most or least significant byte of the pmaddubsw
result?
Use _mm_mulhi_epu16
so you're effectively doing (v1 << 8) * (v2 << 8) >> 16
, producing the result in the opposite byte from the input words. Since you say the product is strictly less than 256, you'll get an 8-bit result in the low byte of each 16-bit word.
(If your inputs are signed, use _mm_mulhi_epi16
, but then a negative result would be sign-extended to the full 16 bits.)
You'll need to change how you load / create one of the inputs so instead of
MSB LSB | MSB LSB
v1_lo (00, 04, 00, 0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01)
element# 15 14 13 12 ... 0
you have this: (both using Intel's notation where the left element is the highest number, so vector shifts like _mm_slli_epi128
shift bytes to the left in the diagram).
MSB LSB | MSB LSB
v1_hi (04, 00, 0e, 00, 04, 00, 04, 00, 0a, 00, 0f, 00, 05, 00, 01, 00)
element# 15 14 13 12 ... 0
With v2
still having its non-zero bytes in the high half of each word element, simply _mm_mullo_epi16(v1_hi, v2)
, and you'll get (v1 * v2) << 8
for free.
If you're already unpacking bytes with zeros to obtain v1 and v2, then unpack the other way. If you were using pmovzx
(_mm_cvtepu8_epi16
), then switch to using _mm_unpacklo_epi8(_mm_setzero_si128(), packed_v1 )
.
If you were loading these vectors from memory in this already-zero-padded form, use an unaligned load offset by 1 byte so the zeros end up in the opposite location.
If what you really want is to start with input bytes that aren't unpacked with zeros to start with, I don't think you can avoid that. Or if you're masking instead of unpacking (to save shuffle-port throughput by using _mm_and_si128
instead), you're probably going to need a shift somewhere. You can shift instead of masking one way, though, using v1_hi = _mm_slli_epi16(v, 8)
: a left-shift by 8 with word granularity will knock leave the low byte zeroed.
Upvotes: 3
Reputation: 1185
Shift v1
or v2
and then use_mm_mullo_epi16()
.
Possible XY Problem? My guess is that _mm_unpacklo_epi8()
and _mm_packus_epi16()
may be useful for you.
Upvotes: 0