namea hang
namea hang

Reputation: 11

Which execution ports can SIMD shuffles use for AVX2 and NEON?

When looking at Intel's Optimization Reference Manual, I noticed the section : HANDLING PORT 5 PRESSURE.

It basically says Port 5 in Sandy Bridge microarchitecture includes shuffle units which frequently become a performance bottleneck in code that does much shuffling.

My question: Does that mean this port 5 shuffle bottleneck only happens on Sandy Bridge? What about Alder Lake or other architectures? Why dispatch shuffle execution unit to only one port, is there some strict restrictions for this unit?

Moreover, there are NEON shuffle instructions like vtbl1_s8 similar to AVX2 _mm256_shuffle_epi8. Is there any performance bottleneck caused by port restrictions in Neon for tbl?

I assume there are some port bottle neck in Neon for tbl. However, I found no related documentation.

Upvotes: 1

Views: 97

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365537

Intel since Ice Lake has another shuffle execution unit on port 1 which can handle some (but not all) shuffles up to 256-bit. For example it can handle shufps but not unpcklps or unpcklpd, even though unpcklpd can be expressed as a shufps with the right immediate. (unpcklps can't unless both inputs are the same register.)

AMD Zen-family has shuffle units on multiple ports. Zen 4 and later typically have equal or better shuffle throughput for an instruction than Intel P-cores. (Zen 1 decoded 256-bit instructions to at least 2 uops, with lane-crossing shuffles being significantly worse. Zen 2 and 3 for some reason still decode vpermq ymm, ymm, i8 and vpermd y,y,y to 2 uops, but are fast for vperm2i128 which was 8 uops on Zen 1.)

See https://uops.info/ for the full details on which ports are needed by which instruction on modern x86 uarches, derived from automated micro-benchmarking with performance counters (which you can see by clicking on numbers in the tables; if something seems wrong you can see exactly what instruction-sequence ran at what speed).
https://agner.org/optimize/ has a microarchitecture guide which describes the pipeline, and instruction tables (edited by hand so there are occasional typos, and with less exhaustive microbenchmarking so he doesn't show latency differences from different inputs to different outputs).


ARM CPUs:

Again, it varies by microarchitecture. Look at the optimization guide for a few specific cores you're interested in, like Cortex-A76. I don't know of any resources that aggregate that info the way uops.info does for x86.

Some earlier (and lower-power) ARM Cortex designs have lowish throughput for tbl, but Cortex-A76 has 2/clock throughput (with 2-cycle latency) for the 1 and 2-register forms, running on the V pipes. (1 per 2 cycle throughput for 3 register, 2/3 cycle throughput for 4 register.) From this we can conclude that both V (vector) pipes have a shuffle unit that can handle tnr with 1 or 2 registers.

If you care about older CPUs where tbl might be slow, try to use shuffles with fixed behaviour like trn. It's much easier (fewer transistors) to build an execution unit with a few fixed routings for data instead of each byte being able to select from one of 16 or 32 sources, so needing muxers controlled by the bits of the index data from another input register. This is the same reason Intel puts SIMD add/bitwise execution units on every port but not a shuffle unit.

Cortex-A76 has 8 total pipes (execution ports): branch, 3 integer (including one that can handle multi-cycle instructions), 2 FP/ASIMD, 2 load/store.

https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important takes a look at the old Cortex-A53, which was used for the efficiency cores on a lot of big.LITTLE designs, and on its own in low-end CPUs.
https://chipsandcheese.com/p/arms-cortex-a710-winning-by-default has a block diagram of Cortex-A710

Upvotes: 1

Related Questions