Reputation: 883
I have:
vec1
of 64 booleans, stored as uint64_t
.vec2
of 64 bytes (so 512 bits in total), stored as an array: uint_8_t*
.My goal is to sum all bytes in vec2
that have their corresponding bit set to 1 in vec1
. Or in other words: take the dot product.
I assumed this would be a piece of cake for AVX512 instructions and much faster than just writing a for loop that loops over every bit and byte of the vectors. However, I can't get it to work.
My idea:
00110100 .... // Sample byte of 64-bit vector
00000000 00000000 11111111 11111111 00000000 11111111 00000000 00000000 ..... // Turned into 512-bit
__m512i vec2_m512 = _mm512_loadu_si512(vec2);
__m512i result16 = _mm512_maddubs_epi16(a, b);
Upvotes: 2
Views: 48
Reputation: 365457
Multiplying by a boolean is the same as masking, which AVX-512 can do natively.
Then it's just a horizontal sum of bytes, which can be done with the usual trick of vpsadbw
against zero then shuffle and hsum qwords. Since your bytes are unsigned, no extra fixups are needed for vpsadbw
.
_mm512_reduce_add_epi64
which just expand to some shuffles and adds, no new hardware support. As my answer shows, it ends with two vector->scalar moves instead of a final shuffle+add, at least for the 32-bit version. That's pretty close to break-even most of the time.#include <immintrin.h>
// requires __AVX512BW__
uint64_t dot64_bools_u8(uint64_t mask, const uint8_t *bytes)
{
// implicit conversion from uint64_t to __mmask64 is fine; same C type anyway.
__m512i v = _mm512_maskz_loadu_epi8(mask, bytes);
v = _mm512_sad_epu8(v, _mm512_setzero_si512()); // hsum groups of 8 bytes
return _mm512_reduce_add_epi64(v); // a library function of shuffles and adds
}
If the mask is coming from memory, a kmov
load on Intel unfortunately costs the same as a GPR load + kmov k, reg
, so it doesn't actually help to avoid bouncing through a GPR. On Zen 4, kmov k, r/m64
is 2 uops either for register or memory source, so it does help there.
Using the mask with the load is the only option; VPSADBW
doesn't support masking (probably since it deals with mixed element sizes). And since we have a byte mask, any later step is too late. If the mask is on the critical path, using it later would allow more instruction-level parallelism, but we don't have a good option for using it later. Unfortunately that also means the compiler can't use a memory source operand for vpsadbw
; it has to do a separate vmovdqu8
.
Using only 256-bit vectors would also be possible, splitting the mask in half. I'm not sure if AVX10.1-256 has 64-bit mask vectors or if you'd have to shift a GPR and kmov if you want to target that. For normal AVX-512 just avoiding 512-bit vectors, you definitely can still use 64-bit masks, and kshift
saves a uop vs. shifting a GPR for another kmov
. GCC pessimizes this to shifting a GPR and two kmov
, but Clang does it as I intended (Godbolt). (Fewer uops but maybe a longer critical path for the second mask, especially on Intel where kshift
has high latency)
#include <immintrin.h>
// requires __AVX512BW__ && __AVX512VL__
uint64_t dot64_bools_u8_256(uint64_t mask, const uint8_t *bytes)
{
// implicit conversion from uint64_t to __mmask64 is fine; same C type anyway.
__mmask64 m = mask;
__m256i vlo = _mm256_maskz_loadu_epi8(m, bytes);
__m256i vhi = _mm256_maskz_loadu_epi8(m>>32, bytes+32);
vlo = _mm256_sad_epu8(vlo, _mm256_setzero_si256());
vhi = _mm256_sad_epu8(vhi, _mm256_setzero_si256());
__m256i v = _mm256_add_epi32(vlo, vhi); // I was thinking vpaddd could maybe use a shorter VEX prefix than vpaddq, but actually both can use 2-byte VEX
__m128i lo = _mm256_castsi256_si128(v);
__m128i hi = _mm256_extracti128_si256(v, 1);
lo = _mm_add_epi32(lo, hi);
lo = _mm_add_epi32(lo, _mm_unpackhi_epi64(lo,lo));
return (uint32_t)_mm_cvtsi128_si32(lo);
}
_mm256_mask_reduce_add_epi8
exists but only for 128 and 256, and it returns char
(non-widening). _mm256_reduce_add_epi64
doesn't exist, according to Intel's intrinsics guide (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE_ALL,AVX_ALL,AVX_512,Other&text=_mm256_reduce_add&ig_expand=4083), strangely only epi8
, epi16
, and ph
(half-precision float). So we just do it manually: Fastest way to do horizontal SSE vector sum (or other reduction)
Upvotes: 4