Dividing packed 16-bit integer with mask using AVX512 or SVML intrinsics

Question

I am looking for a solution for dividing packed 16-bit integers with mask (__mmask16 for example). _mm512_mask_div_epi32 intrinsics seem to be good; however they only support packed 32-bit integers, which unnecessarily forces me to wide my packed 16-bit to packed 32-bit before using.

Peter Cordes · Accepted Answer

_mm512_mask_div_epi32 isn't a real intrinsic; it's an Intel SVML function. x86 doesn't have SIMD integer division, only SIMD FP double and float.

If your divisor vectors are compile-time constants (or reused for multiple dividends), see https://libdivide.com/ for exact division using a multiplicative inverse.

Otherwise probably your best bet is to convert to single-precision FP which can exactly represent every 16-bit integer. If _mm512_mask_div_epi32 does any extra work to deal with the fact that FP32 can't exactly represent every possible int32_t, that's wasted for your use case.

(Some future CPUs may have support for some kind of 16-bit FP in the IA cores, not just the GPU, but for now the best way to take advantage of the high-throughput hardware div/sqrt SIMD execution unit is via conversion to float. Like one __m256 per 5 clock cycles for Skylake vdivps ymm with a single uop, or one per 10 clock cycles for __m512 with a 3-uop vdivps zmm)

Dividing packed 16-bit integer with mask using AVX512 or SVML intrinsics

Answers (1)

Related Questions