Reputation: 81
I am looking for a solution for dividing packed 16-bit integers with mask (__mmask16
for example). _mm512_mask_div_epi32
intrinsics seem to be good; however they only support packed 32-bit integers, which unnecessarily forces me to wide my packed 16-bit to packed 32-bit before using.
Upvotes: 0
Views: 629
Reputation: 365267
_mm512_mask_div_epi32
isn't a real intrinsic; it's an Intel SVML function. x86 doesn't have SIMD integer division, only SIMD FP double
and float
.
If your divisor vectors are compile-time constants (or reused for multiple dividends), see https://libdivide.com/ for exact division using a multiplicative inverse.
Otherwise probably your best bet is to convert to single-precision FP which can exactly represent every 16-bit integer. If _mm512_mask_div_epi32
does any extra work to deal with the fact that FP32 can't exactly represent every possible int32_t
, that's wasted for your use case.
(Some future CPUs may have support for some kind of 16-bit FP in the IA cores, not just the GPU, but for now the best way to take advantage of the high-throughput hardware div/sqrt SIMD execution unit is via conversion to float
. Like one __m256
per 5 clock cycles for Skylake vdivps ymm
with a single uop, or one per 10 clock cycles for __m512
with a 3-uop vdivps zmm
)
Upvotes: 2