Mixing SSE with AVX128 for shorter instructions?

Question

From all the information I could gather, there's no performance penalty with mixing SSE and 128-bit (E)VEX encoded instructions. This suggests that it should be fine to mix the two. This may be beneficial when SSE instructions are often 1 byte shorter than the VEX equivalent.

However, I've never seen anyone, or any compiler, do this. As an example, in Intel's AVX (128-bit) MD5 implementation, various vmovdqa could be replaced with movaps (or this vshufps could be replaced with the shorter shufps, since the dest and src1 register is the same).
Is there any particular reason for this avoidance of SSE, or is there something I'm missing?

Peter Cordes · Accepted Answer

You're right, if YMM uppers are known zero from a vzeroupper, mixing AVX128 and SSE has no penalty and it's a missed optimization not to do so when it would save code size.

Also note that it only saves code size if you don't need a REX prefix. 2-byte VEX is equivalent to REX + 0F for SSE1. Compilers do try to favour low registers to hopefully avoid REX prefixes, but I think they don't look at which combinations of registers are used in each instruction to minimize total REX prefixes. (Or if they do try to do that, they not good at it). Humans can spend time planning like that.

It's pretty minor most of the time, just an occasional byte of code size. That's usually a good thing and can help the front-end. (Or saving a uop for SSE4 blendvps xmm, xmm, over AVX vblendvps xmm, xmm, xmm, xmm on Intel CPUs (same for pd, and pblendvb), if you can arrange to use it without needing another movaps to deal needing the blend control in XMM0. See https://uops.info/)

The downside if you get it wrong is an SSE/AVX transition penalty (on Haswell and Ice Lake), or a false dependency on Skylake. Why is this SSE code 6 times slower without VZEROUPPER on Skylake?. IDK if Zen2 does anything like that; Zen1 splits 256-bit operations into 2 uops and doesn't care about vzeroupper.

For compilers to do it safely, they would have to keep track of more stuff to make sure they don't run an SSE instruction inside a function while a YMM register has a dirty upper half. Compilers don't have an option to limit AVX code-gen to 128-bit instructions only, so they'd have to start tracking paths of execution that could have dirtied a YMM upper half.

However, I think they have to do that anyway on a whole-function basis to know when to use vzeroupper before ret (in functions that don't accept or return a __m256/i/d by value, which would mean the caller is already using wide vectors).

But not needing vzeroupper is a separate thing from whether movaps is performance-safe, so it would be one more thing to track in a similar way. Finding every case where it's safe to avoid a VEX prefix.

Still, there are probably cases where it easy to prove it would be safe. It would be fine if compilers used a conservative algorithm that had some missed optimizations when branching might or might not have dirtied uppers, and in that case always using VEX, and always using vzeroupper.

Mixing SSE with AVX128 for shorter instructions?

Answers (1)

Related Questions