Intel FMA Instructions Offer Zero Performance Advantage

Question

Consider the following instruction sequence using Haswell's FMA instructions:

  __m256 r1 = _mm256_xor_ps (r1, r1);
  r1 = _mm256_fmadd_ps (rp1, m6, r1);
  r1 = _mm256_fmadd_ps (rp2, m7, r1);
  r1 = _mm256_fmadd_ps (rp3, m8, r1);

  __m256 r2 = _mm256_xor_ps (r2, r2);
  r2 = _mm256_fmadd_ps (rp1, m3, r2);
  r2 = _mm256_fmadd_ps (rp2, m4, r2);
  r2 = _mm256_fmadd_ps (rp3, m5, r2);

  __m256 r3 = _mm256_xor_ps (r3, r3);
  r3 = _mm256_fmadd_ps (rp1, m0, r3);
  r3 = _mm256_fmadd_ps (rp2, m1, r3);
  r3 = _mm256_fmadd_ps (rp3, m2, r3);

The same computation can be expressed using non-FMA instructions as follows:

  __m256 i1 = _mm256_mul_ps (rp1, m6);
  __m256 i2 = _mm256_mul_ps (rp2, m7);
  __m256 i3 = _mm256_mul_ps (rp3, m8);
  __m256 r1 = _mm256_xor_ps (r1, r1);
  r1 = _mm256_add_ps (i1, i2);
  r1 = _mm256_add_ps (r1, i3);

  i1 = _mm256_mul_ps (rp1, m3);
  i2 = _mm256_mul_ps (rp2, m4);
  i3 = _mm256_mul_ps (rp3, m5);
  __m256 r2 = _mm256_xor_ps (r2, r2);
  r2 = _mm256_add_ps (i1, i2);
  r2 = _mm256_add_ps (r2, i3);

  i1 = _mm256_mul_ps (rp1, m0);
  i2 = _mm256_mul_ps (rp2, m1);
  i3 = _mm256_mul_ps (rp3, m2);
  __m256 r3 = _mm256_xor_ps (r3, r3);
  r3 = _mm256_add_ps (i1, i2);
  r3 = _mm256_add_ps (r3, i3);

One would expect the FMA version to provide some performance advantage over the non-FMA version.

But unfortunately, in this case, there is zero (0) performance improvement.

Can anyone help me understand why?

I measured both approaches on a core i7-4790 based machine.

UPDATE:

So I analyzed the generated machine code and determined that the MSFT VS2013 C++ compiler was generating the machine code such that the dependency chains of r1 and r2 could dispatch in parallel since Haswell has 2 FMA pipes.

r3 must dispatch after r1 so in this case, the second FMA pipe is idle.

I thought that if I unroll the loop to do 6 sets of FMAs instead of 3, then I could keep all the FMA pipes busy on every iteration.

Unfortunately, when I checked the assembly dump in this case, the MSFT compiler did not choose register assignments that would have allowed the type of parallel dispatch that I was looking for and I verified that I didn't get the performance increase that I was looking for.

Is there a way I can change my C code (using intrinsics) to enable the compiler to generate better code?

BeeOnRope · Accepted Answer

You didn't provide a full code sample that includes the surrounding loop (presumably there is a surrounding loop), so it is hard to answer definitively, but the main problem I see is that the latency of the dependency chains of your FMA code is considerably longer than your multiply + addition code.

Each of the three blocks in your FMA code is doing the same independent operation:

TOTAL += A1 * B1;
TOTAL += A2 * B2;
TOTAL += A3 * B3;

As it is structured, each operation depends on the previous due since each one reads and writes total. So the latency of this string of operation is 3 ops x 5 cycles/FMA = 15 cycles.

In your re-written version without FMA, the dependency chain on TOTAL is now broken, since you've done:

TOTAL_1 = A1 * B1;  # 1
TOTAL_2 = A2 * B2;  # 2
TOTAL_3 = A3 * B3;  # 3

TOTAL_1_2 = TOTAL_1 + TOTAL2;  # 5, depends on 1,2
TOTAL = TOTAL_1_2 + TOTAL3;    # 6, depends on 3,5

The first three MUL instructions can execute independently since they don't have any dependencies. The two add instructions are serially dependent on the multiplications. The latency of this sequence is thus 5 + 3 + 3 = 11.

So the latency of the second method is lower, even though it uses more CPU resources (5 total instructions issued). It is certainly possible then, that depending on how the overall loop is structured, that the lower latency cancels out the throughput advantages of FMA for this code - if it is at least partly latency bound.

For a more comprehensive static analysis, I highly recommend Intel's IACA - which can take a loop iteration like the above, and tell you exactly what the bottleneck is, at least in the best case scenario. It can identify the critical paths in the loop, whether you are latency bound, etc.

Another possibility is that you are memory bound (latency or throughput), in which you'll also see similar behavior for FMA vs MUL + ADD.

Intel FMA Instructions Offer Zero Performance Advantage

Answers (2)

Related Questions