Reputation: 1105
Changing add
to adc
in the highlighted line below significantly improves performance. I find it very counter-intuitive, since add
has more ports to execute and it does not depend on flags.
CPU: Intel i7-9750H (Coffee Lake).
UOPS_ISSUED.ANY for add
= ~2.87 uops/cycle.
UOPS_ISSUED.ANY for adc
= ~3.47 uops/cycle.
Retire slots is 98.5% of uops issued in both cases.
And it's reflected in benchmark times, add
version is a lot slower.
I would appreciate if someone could help me understand why add
is slower? I can provide more metrics, just don't know what to look for.
# Code to multiply large integer by a qword.
# RSI = input large integer (qword array).
# RDI = output large integer (qword array).
# RDX = qword to multiply the large integer by.
# ECX = number of 32-byte blocks to process (i.e. qwords count / 4).
# RAX = carry
xor eax, eax
.p2align 4
L_mul_word_loop:
mulx r9, r8, [rsi]
add r8, rax ### <-- "adc r8, rax" (+20-30% performance)
mov [rdi], r8
mulx rax, r8, [rsi+8] # rax:r8 = RDX * [rsi+8]
adc r9, r8 # add high-half of last mulx to low half of this
mov [rdi+8], r9 # and store.
mulx r9, r8, [rsi+16]
adc r8, rax
mov [rdi+16], r8
mulx rax, r8, [rsi+24]
adc r9, r8
mov [rdi+24], r9
adc rax, 0 # include CF into high half of last mulx: can't wrap
add rdi, 32
add rsi, 32
dec ecx
jnz L_mul_word_loop
The loop-carried dependency chain is through CF between the add
-> adc
instructions, and in RAX
from the bottom of the loop back to the top. (The largest possible product has a high half less than 0xFF...
, so the final adc rax, 0
can't wrap and produce its own carry-out. e.g. 0xFF * 0xFF = 0xFE01
). This dependency chain is 5 cycles long.
(Experimentally, see comments, using lea
instead of add
for pointer increments and removing the adc rax, 0
to shorten it to 4 cycles isn't faster than this version using adc
at the top of the loop.)
Minimal reproducible code: add-adc.asm
Full code on GitHub (only tested on Mac; older version was working on Ubuntu, although I haven't checked in a while). I'm observing this effect on this test:
build/benchmark_asm_x86_64 --benchmark_filter=mul_word/100 --benchmark_repetitions=3 --benchmark_report_aggregates_only=true
Update
I added a minimal reproducible example link (single ASM file). Tried to gather per-iteration metrics. Note that outer loop iteration count is much higher than inner loop (want to keep arrays small enough for L1 cache), hopefully that does not skew the numbers too much. Here's what I got:
add
per inner loop iteration:
~ 6.88 cycles
~ 20.00 uops_issued.any
~ 4.83 uops_dispatched_port.port1
adc
per inner loop iteration:
~ 6.12 cycles
~ 20.11 uops_issued.any
~ 4.34 uops_dispatched_port.port1
Upvotes: 12
Views: 428
Reputation: 2409
I think this is caused by suboptimal scheduling, taking extra hit from writeback conflict penalties.
Comparing uops.info measurements for 64-bit mulx
vs. variants of imul
we can reasonably conclude that mulx
comprises one port 1 uop with latency 3 producing the low 64 bits of the result, plus one port 5 uop producing the high 64 bits one cycle later.
Consider what happens when a 3-cycle uop is issued on port 1 at cycle i, and a 1-cycle uop is pending. It cannot be issued at cycle i (the port cannot accept a second uop on that cycle). It also cannot be issued on cycle i+2: it would cause a writeback conflict on cycle i+3 on that port (it cannot produce results for two uops on the same cycle).
By constructing a loop with forced writeback conflicts, Fabian Giesen demonstrated there's apparently an extra penalty when that happens.
Hence if you have a mix of single-cycle and multi-cycle uops competing for the same port, it's a bit of a landmine. My initial attempt to solve that was to add a fifth mulx
instruction in the loop (which outputs would be just discarded), so there's a constant pressure of multi-cycle uops on p1 and p5 and add
s would be scheduled elsewhere. That worked to some degree, but it's possible to do even better!
The following measurements are from Skylake (i7-6700).
The baseline version runs at ~6.9 cycles per iteration:
10,337,227,178 cycles:u
4,283,434,673 uops_dispatched_port.port_0:u
7,278,477,707 uops_dispatched_port.port_1:u
6,849,168,521 uops_dispatched_port.port_5:u
5,609,252,055 uops_dispatched_port.port_6:u
10,384,026,044 cycles:u
3,922,820,702 uops_dispatched_port.port_2:u
4,024,756,546 uops_dispatched_port.port_3:u
6,388,517,494 uops_dispatched_port.port_4:u
4,087,151,242 uops_dispatched_port.port_7:u
Note that the number of store-data uops on port 4 is more than 5% higher than it should be. Some stores are being replayed, and this will have an increasingly confounding effect on the following measurements.
The variant with adc
in place of the initial add
runs at 6 cycles per iteration:
8,877,983,794 cycles:u
5,181,755,017 uops_dispatched_port.port_0:u
6,525,640,349 uops_dispatched_port.port_1:u
6,019,129,311 uops_dispatched_port.port_5:u
6,295,528,774 uops_dispatched_port.port_6:u
9,040,426,883 cycles:u
3,762,317,354 uops_dispatched_port.port_2:u
3,814,931,097 uops_dispatched_port.port_3:u
7,292,924,631 uops_dispatched_port.port_4:u
4,462,674,038 uops_dispatched_port.port_7:u
(note even higher count on port 4)
Now let's swap around increments of rsi
and rdi
(rsi
is needed earlier). This brings a very noticeable improvement to 5.6 cycles per iteration:
8,376,301,855 cycles:u
5,129,834,339 uops_dispatched_port.port_0:u
6,632,790,174 uops_dispatched_port.port_1:u
6,088,383,045 uops_dispatched_port.port_5:u
6,173,097,806 uops_dispatched_port.port_6:u
8,404,972,940 cycles:u
4,287,284,508 uops_dispatched_port.port_2:u
4,317,891,165 uops_dispatched_port.port_3:u
7,408,432,079 uops_dispatched_port.port_4:u
3,429,913,047 uops_dispatched_port.port_7:u
(port 4 count goes higher yet again)
Now, if we think that single-cycle instructions issuing on port 1 and 5 are causing grief to the scheduler, can we figure out a way to force them on other ports somehow? Yes! Macro-fused predicted non-taken add-jcc pair can issue only on ports 0 and 6. So let's add jz .+2
after add rsi, 32
(note, Skylake JCC erratum could bite me here, but luckily our loop starts at 16 bytes mod 32, so we avoid the problematic crossing):
8,339,935,074 cycles:u
4,749,784,934 uops_dispatched_port.port_0:u
6,429,206,233 uops_dispatched_port.port_1:u
6,045,479,355 uops_dispatched_port.port_5:u
6,798,134,405 uops_dispatched_port.port_6:u
8,330,518,755 cycles:u
4,250,791,971 uops_dispatched_port.port_2:u
4,284,637,776 uops_dispatched_port.port_3:u
7,379,587,531 uops_dispatched_port.port_4:u
3,503,715,123 uops_dispatched_port.port_7:u
Oh no, our cycles barely budged! What if we finally bite the bullet and nop out the first store (nop dword ptr [rdi]
in place of mov [rdi], r8
):
7,912,840,444 cycles:u
4,648,297,975 uops_dispatched_port.port_0:u
6,334,248,230 uops_dispatched_port.port_1:u
6,059,133,113 uops_dispatched_port.port_5:u
6,982,987,722 uops_dispatched_port.port_6:u
7,843,194,695 cycles:u
3,810,009,654 uops_dispatched_port.port_2:u
3,790,523,332 uops_dispatched_port.port_3:u
6,094,690,751 uops_dispatched_port.port_4:u
2,938,959,057 uops_dispatched_port.port_7:u
That's 5.25 cycles per iteration. Applying the same jz .+2
trick to add rdi, 32
, we get:
7,630,926,523 cycles:u
5,831,880,767 uops_dispatched_port.port_0:u
6,004,354,504 uops_dispatched_port.port_1:u
6,002,011,214 uops_dispatched_port.port_5:u
6,183,831,282 uops_dispatched_port.port_6:u
7,638,112,528 cycles:u
3,751,406,178 uops_dispatched_port.port_2:u
3,695,775,382 uops_dispatched_port.port_3:u
5,638,534,896 uops_dispatched_port.port_4:u
3,091,934,468 uops_dispatched_port.port_7:u
And that's below 5.1 cycles per iteration, where predicted throughput is 5 cycles per iteration. We see that ports 1 and 5 are occupied just by the mulx
instructions. Restoring the nopped out store, I get 5.36 cycles per iteration:
8,041,908,278 cycles:u
5,599,216,700 uops_dispatched_port.port_0:u
6,010,109,267 uops_dispatched_port.port_1:u
6,002,448,325 uops_dispatched_port.port_5:u
6,417,804,937 uops_dispatched_port.port_6:u
8,050,871,976 cycles:u
4,183,403,858 uops_dispatched_port.port_2:u
4,166,022,265 uops_dispatched_port.port_3:u
7,179,871,612 uops_dispatched_port.port_4:u
3,695,465,024 uops_dispatched_port.port_7:u
It's unclear to me what's causing the replays.
Upvotes: 2