Reputation: 64955
Are there any register-to-register1 AVX or AVX2 instructions which move data in any way between 64-bit halves of 128-bit lanes in ymm
regs, that don't use port 5 on contemporary Intel2?
1 Such a thing kind of exists for memory sources, in the form of the D
and Q
broadcast instructions.
2 Haswell through Skylake-S (although if anything exists in AVX-512 as implemented in SKX it's worth mentioning).
Upvotes: 3
Views: 497
Reputation: 364593
I don't think it's possible in 1 reg-reg instruction, but store/reload can move data in-lane without port 5. Even funky stuff like dppd
or vcvtps2pd
need a port 5 shuffle. All register-source shuffle instructions run on port 5 in Haswell and later (until Ice Lake adds a 2nd shuffle unit on another port that can do some shuffles).
Obviously a misaligned reload can do any byte-shift but that will cause a store-forwarding stall, and you'd have to mask off unwanted data.
vmovddup x/y/zmm, [mem]
runs purely on load ports, exactly like vbroadcastsd
. It's an in-lane broadcast of the low qword. vmovsldup
and vmovshdup
also only need a load port, but don't meet your requirement of moving between 64-bit halves of a lane.
There's no movhdup
that duplicates the high half within each lane, only movddup
that duplicates the low double-precision FP element. SSE3 for xmm, AVX1 for the ymm version.
As @harold points out, phminposuw
can put data from the high 64 bits into the low 64 bits. But it's not available in a YMM version. It may be the only instruction that has a special-purpose execution unit that can do that outside of shuffles. psadbw
works inside 64-bit elements. vdbpsadbw
is 1 uop for p5 on SKX. mpsadbw
is multi-uop including 2p5. phadd
instructions are also 2p5.
Zen 2 has 0.5c throughput for vpshufd ymm
(instlat). It's slower than Intel at handling lane-crossing shuffles with granularity less than 128-bit, but good performance on in-lane shuffles and 128-bit shuffles like vperm2f128
.
Upvotes: 4