Reputation: 13
SIMDKR string matching algorithm used _mm256_movemask_epi8
to convert a Vector256 to an int
by extracting the high bit of each byte.
I want to implement this clang algorithm in C#, by using Vector512 instead of 256, but I can't find a method to do it.
There is a Avx2.MoveMask()
,and no Avx512F/BW/VBMI/DQ.MoveMask
.
const __m256i first = _mm256_set1_epi8(needle[0]);
const __m256i last = _mm256_set1_epi8(needle[m - 1]);
const __m256i block_first1 = _mm256_loadu_si256((const __m256i *)(s + i));
const __m256i block_last1 = _mm256_loadu_si256((const __m256i *)(s + i + m - 1));
const __m256i eq_first1 = _mm256_cmpeq_epi8(first, block_first1);
const __m256i eq_last1 = _mm256_cmpeq_epi8(last, block_last1);
const uint32_t mask1 = _mm256_movemask_epi8(_mm256_and_si256(eq_first1, eq_last1));
I use bits operation to replace _mm512_movepi8_mask
with this:
ulong mask = ((ulong)Avx2.MoveMask(buffer.GetUpper()) << 32) | (uint)Avx2.MoveMask(buffer.GetLower());
Is this right? Is this have the best performance?
Upvotes: 1
Views: 138
Reputation: 64913
AVX512 is (also outside of C#) a bit different when it comes to extracting a mask of the upper bits than AVX2, VPMOVMSKB
has no direct 512-bit equivalent. In raw AVX512 you can convert a vector to a mask (the AVX512 concept of a mask) with the VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M family of instructions, and then you can move the mask from a mask register to a general-purpose register with the kmov
-family of instructions.
C# treats masks a bit differently than raw AVX512 does (masks are mostly represented via the Vector512<T>
type as well, you're not normally working with the mask-as-an-integer, I'm not entirely sure yet what the implications of that are for mask-manipulation code), but you can do both of those steps (converting a vector to a mask and moving it from a mask register to a general purpose register) combined with Vector512.ExtractMostSignificantBits.
I tried that under .NET 8 and I got assembly code like this:
vpmovb2m k1,zmm0
kmovq rax,k1
Looks good to me.
Going more into the actual context of a string comparison, in C# you get some comparisons:
Vector512.Equals
which returns a mask as an Vector512<T>
Avx512BW.CompareEqual
(this is for bytes and words, comparisons for other types are in other classes) which also returns a mask as an Vector512<T>
Vector512.EqualsAny
, Vector512.EqualsAll
, which don't return a mask at all, only a boolean (for both of them I got a comparison and kortestq
, if the inputs are Vector512<byte>
, followed by some branch or setcc depending on how the boolean is used)If you want the result of a comparison as a mask in an integer, you can combine eg Vector512.Equals
with Vector512.ExtractMostSignificantBits
. That doesn't result in pointlessly converting a mask to a vector then back to a mask, you get the right thing, I tried it and got this:
vpcmpeqb k1,zmm0,zmmword ptr [rax+50h]
kmovq rax,k1
Upvotes: 1
Reputation: 365507
AVX-512 can't extract high bits from vector elements directly to a general-purpose register, only into an AVX-512 mask register (k0
-k7
). It does this with vpmovb2m
, intrinsic _mm512_movepi8_mask
.
But you don't want this for string compares.
AVX-512 doesn't have _mm512_cmpeq_epi8
either.
AVX-512 compares directly produce an integer mask (in a mask register), like vpcmpub
, e.g. _mm512_cmpeq_epu8_mask
or vptestmb
Which you then test with ktest
or kortestq k0, k1
to set integer FLAGS if you want to branch on them. Or kmov
to integer regs if you to popcount
to count matches or something.
You can do stuff like if (mask)
and let the compiler figure it out, hopefully using kortest same,same
instead of kmov
and legacy test eax,eax
.
There are C intrinsics kortest
, including for its CF result (set if all-ones), not just its ZF result (set if all zeros, in this case meaning no matches). e.g. unsigned char _kortest_mask64_u8 (__mmask64 a, __mmask64 b, unsigned char* all_ones)
produces both outputs, returning a 0 / 1 integer and producing another output by reference.
The AVX-512 equivalent of your C, using 64-byte vectors, is:
const __m256i first = _mm512_set1_epi8(needle[0]);
const __m256i last = _mm512_set1_epi8(needle[m - 1]);
__m512i block_first1 = _mm512_loadu_si512(s + i); // AVX-512 loads intrinsics take void*
__m512i block_last1 = _mm512_loadu_si512(s + i + m - 1);
__mmask64 eq_first1 = _mm512_cmpeq_epi8_mask(first, block_first1);
__mmask64 eq_last1 = _mm512_cmpeq_epi8_mask(last, block_last1);
uint64_t mask1 = eq_first1 & eq_last1;
// KAND, or KTEST if branching on it, or compilers might choose 2x KMOV + AND
But this might be less efficient: misaligned 512-bit load have worse penalties on current microarchitectures than misaligned 256-bit loads. If 32 bytes at the start/end are enough to detect match candidates positions almost all the time anyway, 64-byte vectors might slow you down more than they speed you up.
Sorry I don't know the C# intrinsics for these, but the key point is that AVX-512 compares work differently from previous SIMD ISAs, not producing vectors of all-0 / all-1 elements. These are the C intrinsics you want, hopefully giving you something to search for in the C# docs.
Upvotes: 1