fwefew 4t4tg
fwefew 4t4tg

Reputation: 59

SIMD: reduce/convert/project/compress/align __m256i to __mmask8

I have a __m256i register containing packed 32-bit integers of zero/one values:

reg : [0 1 1 0 0 1 1 0] // 256-bit register of 8 integers each either 0 or 1

I need a SIMD function that converts 'reg' to __mmask8. In this example, I want an output equal to binary b'01100110 = 102. Symbolically:

__mmask8 convert(__m256i); // API
102 = convert([0 1 1 0 0 1 1 0])

Upvotes: 0

Views: 78

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365517

__mmask8 is an AVX-512 type but you've only tagged this AVX2. Do you just mean an 8-bit integer?

With AVX-512 you'd use vptestmd of a vector against itself to get a bitmask of which elements are zero / non-zero: _mm256_test_epi32_mask(v, v) (intrinsics guide).

If your source data is in memory, not the result of a recent computation, you might consider testing against an all-ones vector constant (_mm256_set1_epi32(-1)) so the instruction can use a memory source operand like vptestmd k1, ymm16, [rdi], which wouldn't be possible if it needed the same vector as both source operands.

vpmovd2m exists which takes only one source operand, but it takes the high bit of each element like SSE/AVX movmskps, while for your elements the low bit is the one you want due to an inconvenient choice of 0 / 1 as bools rather than a more traditional SIMD mask of 0 / -1.


With only AVX2, the only instruction for extracting 1 bit per 32-bit element is vmovmskps which takes the top bit, so you'd need to left-shift or compare for equal to 1 if you can't make your elements normal SIMD booleans in the first place. (All-zero (0) or all-one bits (-1) like you'd get from _mm256_cmpeq_epi32)

_mm256_movemask_ps( _mm256_castsi256_ps( _mm256_slli_epi32(v, 31) ))

In C we have to use _mm256_castsi256_ps to keep the compiler happy when we want to use an FP instruction on the result of an integer instruction. It's free in asm except for potentially an extra 1 cycle of bypass-forwarding latency penalty, including for all future uses of that vector value until the next context-switch.

In this case the potential latency penalty is still cheaper than using _mm256_movemask_epi8( __m256i ) and needing pext (_pext_u32) to take every 4th bit and left-pack.

Unless your use for the bitmask can equally well use a mask that has 4 bits from each 32-bit element. (If you need all 4 bits to match instead of just the low or high one depending on what shift count you choose, you'd need a compare instead. Like _mm256_cmpgt_epi32(v, _mm256_setzero_si256()) to find elements that are signed-greater than 0. An all-zero vector constant is the cheapest to materialize for the compiler, just vpxor same,same which is literally as cheap as a NOP on many recent CPUs. That's why I chose that instead of cmpeq against set1_epi32(1). (cmpeq against 0 will invert while making a vector mask if you want that.)

Upvotes: 2

fwefew 4t4tg
fwefew 4t4tg

Reputation: 59

Using Peter's answer above, but just summarizing a bit more:

const __m256i One = {4294967297,4294967297,4294967297,4294967297};
auto out = _mm_mask_test_epi32_mask(255, reg, One)

and where, keeping with my original example, reg=[0 1 1 0 0 1 1 0] produces out=102 as desired

Upvotes: 0

Related Questions