Are there any register-to-register 1 AVX or AVX2 instructions which move data in any way between 64-bit halves of 128-bit lanes in ymm regs, that don't use port 5 on contemporary Intel 2 ? 1 Such a thing kind of exists for memory sources, in the form of the D and Q broadcast instructions. 2 Haswell through Skylake-S (although if anything exists in AVX-512 as implemented in SKX it's worth mentioning).

Reputation: 64955

In-lane, cross 64-bit element data movement in AVX2

Are there any register-to-register¹ AVX or AVX2 instructions which move data in any way between 64-bit halves of 128-bit lanes in ymm regs, that don't use port 5 on contemporary Intel²?

¹_{Such a thing kind of exists for memory sources, in the form of the D and Q broadcast instructions.}

²_{Haswell through Skylake-S (although if anything exists in AVX-512 as implemented in SKX it's worth mentioning).}

Upvotes: 3

Answers (1)

Peter Cordes

Reputation: 364593

I don't think it's possible in 1 reg-reg instruction, but store/reload can move data in-lane without port 5. Even funky stuff like dppd or vcvtps2pd need a port 5 shuffle. All register-source shuffle instructions run on port 5 in Haswell and later (until Ice Lake adds a 2nd shuffle unit on another port that can do some shuffles).

Obviously a misaligned reload can do any byte-shift but that will cause a store-forwarding stall, and you'd have to mask off unwanted data.

vmovddup x/y/zmm, [mem] runs purely on load ports, exactly like vbroadcastsd. It's an in-lane broadcast of the low qword. vmovsldup and vmovshdup also only need a load port, but don't meet your requirement of moving between 64-bit halves of a lane.

There's no movhdup that duplicates the high half within each lane, only movddup that duplicates the low double-precision FP element. SSE3 for xmm, AVX1 for the ymm version.

As @harold points out, phminposuw can put data from the high 64 bits into the low 64 bits. But it's not available in a YMM version. It may be the only instruction that has a special-purpose execution unit that can do that outside of shuffles. psadbw works inside 64-bit elements. vdbpsadbw is 1 uop for p5 on SKX. mpsadbw is multi-uop including 2p5. phadd instructions are also 2p5.

Zen 2 has 0.5c throughput for vpshufd ymm (instlat). It's slower than Intel at handling lane-crossing shuffles with granularity less than 128-bit, but good performance on in-lane shuffles and 128-bit shuffles like vperm2f128.

Upvotes: 4

In-lane, cross 64-bit element data movement in AVX2

Answers (1)

Related Questions