Serge Rogatch
Serge Rogatch

Reputation: 15070

Conditional instructions in AVX2

Can you give the list of conditional instructions available in AVX2? So far I've found the following:

Are there something like conditional multiply and conditional add, etc.?

Also if instructions taking imm8 count (like _mm256_blend_*), could you explain how to get that imm8 after a vector comparision?

Upvotes: 1

Views: 1447

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365267

AVX512 introduces optional zero-masking and merge-masking for almost all instructions.

Before that, to do a conditional add, mask one operand (with vandps or vandnps for the inverse) before the add (instead of vblendvps on the result). This is why packed-compare instructions/intrinsics produce all-zero or all-one elements.

0.0 is the additive identity element, so adding it is a no-op. (Except for IEEE semantics of -0.0 and +0.0, I forget how that works exactly).

Masking a constant input instead of blending the result avoids making the critical path longer, for something like conditionally adding 1.0.


Conditional multiply is more cumbersome because 0.0 is not the multiplicative identity. You need to multiply by 1.0 to keep a value unchanged, and you can't easily produce that with an AND or ANDN with a compare result. You can blendv an input, or you can do the multiply and blendv the output.

The alternative to blendv is at least 3 booleans, like AND/ANDN/OR, but that's usually not worth it. Although note that Haswell runs vblendvps and vpblendvb as 2 uops for port 5, so it's a potential bottleneck compared to using integer booleans that can run on any port. Skylake runs them vblendvps as 2 uops for any port. It could make sense to do something to avoid having a blendv on the critical path, though.

Masking an input operand or blending the result is generally how you do branchless SIMD conditionals.

BLENDV is usually at least 2 uops, so it's slower than an AND.

Immediate blends are much more efficient, but you can't use them, because the imm8 blend control has to be a compile-time constant embedded into the instruction's machine code. That's what immediate means in an assembly-language context.

Upvotes: 2

Aki Suihkonen
Aki Suihkonen

Reputation: 20037

Intel Intrinsics Guide suggests gather, load and store operating with a mask. The immediate imm8 in blend_epi16 is not programmable unless self-modifying code or a jump table is considered an option. It's still possible to derive using pext from BMI2 to compact half of odd positioned bits from the result of movemask -- one gets 32 independent mask bits from movemask in AVX2, but blend_epi16 uses each bit to control four bytes--or one 16-bit variable in each bank.

Upvotes: 3

Related Questions