Reputation: 15070
Can you give the list of conditional instructions available in AVX2? So far I've found the following:
_mm256_blendv_
* for selection from a
and b
based on mask c
Are there something like conditional multiply and conditional add, etc.?
Also if instructions taking imm8
count (like _mm256_blend_
*), could you explain how to get that imm8
after a vector comparision?
Upvotes: 1
Views: 1447
Reputation: 365267
AVX512 introduces optional zero-masking and merge-masking for almost all instructions.
Before that, to do a conditional add, mask one operand (with vandps
or vandnps
for the inverse) before the add (instead of vblendvps
on the result). This is why packed-compare instructions/intrinsics produce all-zero or all-one elements.
0.0
is the additive identity element, so adding it is a no-op. (Except for IEEE semantics of -0.0 and +0.0, I forget how that works exactly).
Masking a constant input instead of blending the result avoids making the critical path longer, for something like conditionally adding 1.0
.
Conditional multiply is more cumbersome because 0.0
is not the multiplicative identity. You need to multiply by 1.0
to keep a value unchanged, and you can't easily produce that with an AND or ANDN with a compare result. You can blendv an input, or you can do the multiply and blendv the output.
The alternative to blendv is at least 3 booleans, like AND/ANDN/OR, but that's usually not worth it. Although note that Haswell runs vblendvps
and vpblendvb
as 2 uops for port 5, so it's a potential bottleneck compared to using integer booleans that can run on any port. Skylake runs them vblendvps
as 2 uops for any port. It could make sense to do something to avoid having a blendv on the critical path, though.
Masking an input operand or blending the result is generally how you do branchless SIMD conditionals.
BLENDV is usually at least 2 uops, so it's slower than an AND.
Immediate blends are much more efficient, but you can't use them, because the imm8
blend control has to be a compile-time constant embedded into the instruction's machine code. That's what immediate means in an assembly-language context.
Upvotes: 2
Reputation: 20037
Intel Intrinsics Guide suggests gather, load and store operating with a mask. The immediate imm8 in blend_epi16 is not programmable unless self-modifying code or a jump table is considered an option. It's still possible to derive using pext from BMI2 to compact half of odd positioned bits from the result of movemask -- one gets 32 independent mask bits from movemask in AVX2, but blend_epi16 uses each bit to control four bytes--or one 16-bit variable in each bank.
Upvotes: 3