Reputation: 1041
Consider the following instruction sequence using Haswell's FMA instructions:
__m256 r1 = _mm256_xor_ps (r1, r1);
r1 = _mm256_fmadd_ps (rp1, m6, r1);
r1 = _mm256_fmadd_ps (rp2, m7, r1);
r1 = _mm256_fmadd_ps (rp3, m8, r1);
__m256 r2 = _mm256_xor_ps (r2, r2);
r2 = _mm256_fmadd_ps (rp1, m3, r2);
r2 = _mm256_fmadd_ps (rp2, m4, r2);
r2 = _mm256_fmadd_ps (rp3, m5, r2);
__m256 r3 = _mm256_xor_ps (r3, r3);
r3 = _mm256_fmadd_ps (rp1, m0, r3);
r3 = _mm256_fmadd_ps (rp2, m1, r3);
r3 = _mm256_fmadd_ps (rp3, m2, r3);
The same computation can be expressed using non-FMA instructions as follows:
__m256 i1 = _mm256_mul_ps (rp1, m6);
__m256 i2 = _mm256_mul_ps (rp2, m7);
__m256 i3 = _mm256_mul_ps (rp3, m8);
__m256 r1 = _mm256_xor_ps (r1, r1);
r1 = _mm256_add_ps (i1, i2);
r1 = _mm256_add_ps (r1, i3);
i1 = _mm256_mul_ps (rp1, m3);
i2 = _mm256_mul_ps (rp2, m4);
i3 = _mm256_mul_ps (rp3, m5);
__m256 r2 = _mm256_xor_ps (r2, r2);
r2 = _mm256_add_ps (i1, i2);
r2 = _mm256_add_ps (r2, i3);
i1 = _mm256_mul_ps (rp1, m0);
i2 = _mm256_mul_ps (rp2, m1);
i3 = _mm256_mul_ps (rp3, m2);
__m256 r3 = _mm256_xor_ps (r3, r3);
r3 = _mm256_add_ps (i1, i2);
r3 = _mm256_add_ps (r3, i3);
One would expect the FMA version to provide some performance advantage over the non-FMA version.
But unfortunately, in this case, there is zero (0) performance improvement.
Can anyone help me understand why?
I measured both approaches on a core i7-4790 based machine.
UPDATE:
So I analyzed the generated machine code and determined that the MSFT VS2013 C++ compiler was generating the machine code such that the dependency chains of r1 and r2 could dispatch in parallel since Haswell has 2 FMA pipes.
r3 must dispatch after r1 so in this case, the second FMA pipe is idle.
I thought that if I unroll the loop to do 6 sets of FMAs instead of 3, then I could keep all the FMA pipes busy on every iteration.
Unfortunately, when I checked the assembly dump in this case, the MSFT compiler did not choose register assignments that would have allowed the type of parallel dispatch that I was looking for and I verified that I didn't get the performance increase that I was looking for.
Is there a way I can change my C code (using intrinsics) to enable the compiler to generate better code?
Upvotes: 4
Views: 1766
Reputation: 363980
re: your edit: Your code has three dependency chains (r1, r2, and r3), so it can keep three FMAs in flight at once. FMA on Haswell is 5c latency, one per 0.5c throughput, so the machine can sustain 10 FMAs in flight.
If your code is in a loop, and the inputs to one iteration aren't generated by the previous iteration, then you could be getting 10 FMAs in flight that way. (i.e. no loop-carried dependency chain involving the FMAs). But since you don't see a perf gain, there's probably a dep chain causing throughput to be limited by latency.
You didn't post the ASM you're getting from MSVC, but you claim something about register assignments. xorps same,same
is a recognized zeroing idiom that starts a new dependency chain, just like using a register as a write-only operand (e.g. the destination of a non-FMA AVX instruction.)
It's highly unlikely that the code could be correct but still contain a dependency of r3 on r1. Make sure you understand that out-of-order execution with register renaming allows separate dependency chains to use the same register.
BTW, instead of __m256 r1 = _mm256_xor_ps (r1, r1);
, you should use __m256 r1 = _mm256_setzero_ps();
. You should avoid using the variable you're declaring in its own initializer! Compilers sometimes make silly code when you use uninitialized vectors, e.g. loading garbage from stack memory, or doing an extra xorps
.
Even better would be:
__m256 r1 = _mm256_mul_ps (rp1, m6);
r1 = _mm256_fmadd_ps (rp2, m7, r1);
r1 = _mm256_fmadd_ps (rp3, m8, r1);
This avoids needing an xorps
to zero a reg for the accumulator.
On Broadwell, mulps
has lower latency than FMA.
On Skylake, FMA/mul/add are all 4c latency, one per 0.5c throughput. They dropped the separate adder from port1 and do it on the FMA unit. They shaved a cycle of latency off the FMA unit.
Upvotes: 1
Reputation: 64885
You didn't provide a full code sample that includes the surrounding loop (presumably there is a surrounding loop), so it is hard to answer definitively, but the main problem I see is that the latency of the dependency chains of your FMA code is considerably longer than your multiply + addition code.
Each of the three blocks in your FMA code is doing the same independent operation:
TOTAL += A1 * B1;
TOTAL += A2 * B2;
TOTAL += A3 * B3;
As it is structured, each operation depends on the previous due since each one reads and writes total. So the latency of this string of operation is 3 ops x 5 cycles/FMA = 15 cycles.
In your re-written version without FMA, the dependency chain on TOTAL
is now broken, since you've done:
TOTAL_1 = A1 * B1; # 1
TOTAL_2 = A2 * B2; # 2
TOTAL_3 = A3 * B3; # 3
TOTAL_1_2 = TOTAL_1 + TOTAL2; # 5, depends on 1,2
TOTAL = TOTAL_1_2 + TOTAL3; # 6, depends on 3,5
The first three MUL instructions can execute independently since they don't have any dependencies. The two add instructions are serially dependent on the multiplications. The latency of this sequence is thus 5 + 3 + 3 = 11.
So the latency of the second method is lower, even though it uses more CPU resources (5 total instructions issued). It is certainly possible then, that depending on how the overall loop is structured, that the lower latency cancels out the throughput advantages of FMA for this code - if it is at least partly latency bound.
For a more comprehensive static analysis, I highly recommend Intel's IACA - which can take a loop iteration like the above, and tell you exactly what the bottleneck is, at least in the best case scenario. It can identify the critical paths in the loop, whether you are latency bound, etc.
Another possibility is that you are memory bound (latency or throughput), in which you'll also see similar behavior for FMA vs MUL + ADD.
Upvotes: 7