AVX512 dot product of 64-bit vector of booleans with 512-bit vector of bytes

Question

I have:

A vector vec1 of 64 booleans, stored as uint64_t.
A vector vec2 of 64 bytes (so 512 bits in total), stored as an array: uint_8_t*.

My goal is to sum all bytes in vec2 that have their corresponding bit set to 1 in vec1. Or in other words: take the dot product.

I assumed this would be a piece of cake for AVX512 instructions and much faster than just writing a for loop that loops over every bit and byte of the vectors. However, I can't get it to work.

My idea:

First convert the 64-bit vector to 512-bit using some avx512 instruction. I'm struggling with this part.

00110100 ....  // Sample byte of 64-bit vector

00000000 00000000 11111111 11111111 00000000 11111111 00000000 00000000 .....  // Turned into 512-bit

After that, AND all bytes and sum them into 16-bit integers.

__m512i vec2_m512 = _mm512_loadu_si512(vec2);
__m512i result16 = _mm512_maddubs_epi16(a, b);

Then, some way to sum these 16-bit integers into the final result. How to do this?

Peter Cordes · Accepted Answer

Multiplying by a boolean is the same as masking, which AVX-512 can do natively.
Then it's just a horizontal sum of bytes, which can be done with the usual trick of vpsadbw against zero then shuffle and hsum qwords. Since your bytes are unsigned, no extra fixups are needed for vpsadbw.

Summing 8-bit integers in __m512i with AVX intrinsics
Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2 - AVX-512 intrinsics have inline functions like _mm512_reduce_add_epi64 which just expand to some shuffles and adds, no new hardware support. As my answer shows, it ends with two vector->scalar moves instead of a final shuffle+add, at least for the 32-bit version. That's pretty close to break-even most of the time.

#include 
// requires __AVX512BW__
uint64_t dot64_bools_u8(uint64_t mask, const uint8_t *bytes)
{
      // implicit conversion from uint64_t to __mmask64 is fine; same C type anyway.
    __m512i v = _mm512_maskz_loadu_epi8(mask, bytes);
    v = _mm512_sad_epu8(v, _mm512_setzero_si512());  // hsum groups of 8 bytes
    return _mm512_reduce_add_epi64(v);  // a library function of shuffles and adds
}

If the mask is coming from memory, a kmov load on Intel unfortunately costs the same as a GPR load + kmov k, reg, so it doesn't actually help to avoid bouncing through a GPR. On Zen 4, kmov k, r/m64 is 2 uops either for register or memory source, so it does help there.

Using the mask with the load is the only option; VPSADBW doesn't support masking (probably since it deals with mixed element sizes). And since we have a byte mask, any later step is too late. If the mask is on the critical path, using it later would allow more instruction-level parallelism, but we don't have a good option for using it later. Unfortunately that also means the compiler can't use a memory source operand for vpsadbw; it has to do a separate vmovdqu8.

Using only 256-bit vectors would also be possible, splitting the mask in half. I'm not sure if AVX10.1-256 has 64-bit mask vectors or if you'd have to shift a GPR and kmov if you want to target that. For normal AVX-512 just avoiding 512-bit vectors, you definitely can still use 64-bit masks, and kshift saves a uop vs. shifting a GPR for another kmov. GCC pessimizes this to shifting a GPR and two kmov, but Clang does it as I intended (Godbolt). (Fewer uops but maybe a longer critical path for the second mask, especially on Intel where kshift has high latency)

#include 
// requires __AVX512BW__ && __AVX512VL__
uint64_t dot64_bools_u8_256(uint64_t mask, const uint8_t *bytes)
{
      // implicit conversion from uint64_t to __mmask64 is fine; same C type anyway.
    __mmask64 m = mask;
    __m256i vlo = _mm256_maskz_loadu_epi8(m, bytes);
    __m256i vhi = _mm256_maskz_loadu_epi8(m>>32, bytes+32);
    vlo = _mm256_sad_epu8(vlo, _mm256_setzero_si256());
    vhi = _mm256_sad_epu8(vhi, _mm256_setzero_si256());
    __m256i v = _mm256_add_epi32(vlo, vhi);  // I was thinking vpaddd could maybe use a shorter VEX prefix than vpaddq, but actually both can use 2-byte VEX

    __m128i lo = _mm256_castsi256_si128(v);
    __m128i hi = _mm256_extracti128_si256(v, 1);
    lo = _mm_add_epi32(lo, hi);
    lo = _mm_add_epi32(lo, _mm_unpackhi_epi64(lo,lo));
    return (uint32_t)_mm_cvtsi128_si32(lo);
}

_mm256_mask_reduce_add_epi8 exists but only for 128 and 256, and it returns char (non-widening). _mm256_reduce_add_epi64 doesn't exist, according to Intel's intrinsics guide (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE_ALL,AVX_ALL,AVX_512,Other&text=_mm256_reduce_add&ig_expand=4083), strangely only epi8, epi16, and ph (half-precision float). So we just do it manually: Fastest way to do horizontal SSE vector sum (or other reduction)

AVX512 dot product of 64-bit vector of booleans with 512-bit vector of bytes

Answers (1)

Related Questions