Achieving More FMA3 Performance Than The Theoretical Maximum

Question

For an assignment, I am trying to calculate the theoretical maximum achievable GFLOPS/sec of a single core of my processor, an AMD Ryzen 9 5900HS. According to Agner Fog's tables for a Zen 3 AMD processor (since the R9 5900HS is a Zen 3 processor), the reciprocal throughput of an vfmadd132pd operation is 0.5, giving us a maximum of 2 FMA3 operations per clock. Since each AVX2 vector can hold 4 doubles, that gives us a theoretical maximum of 8 FLOPS/clock. Since my cores average frequency is 3.16 clocks/sec, we get an average theoretical maximum of 25 GFLOPS/sec. All good so far.

Now, the issue arises when I run the Flops benchmark. Running the recommended 2017-Zen binary (both using the precompiled binary present in the repo, and compiling locally), gives me the results

Running Zen tuned binary with 1 thread...

Single-Precision - 128-bit AVX - Add/Sub
    GFlops = 22.528
    Result = 2.86329e+06

Double-Precision - 128-bit AVX - Add/Sub
    GFlops = 11.296
    Result = 1.43905e+06

Single-Precision - 128-bit AVX - Multiply
    GFlops = 25.536
    Result = 3.26002e+06

Double-Precision - 128-bit AVX - Multiply
    GFlops = 12.768
    Result = 1.63224e+06

Single-Precision - 128-bit AVX - Multiply + Add
    GFlops = 33.936
    Result = 3.62228e+06

Double-Precision - 128-bit AVX - Multiply + Add
    GFlops = 16.992
    Result = 1.80521e+06

Single-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 48.48
    Result = 3.09552e+06

Double-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 24.24
    Result = 1.53887e+06

Single-Precision - 256-bit AVX - Add/Sub
    GFlops = 45.12
    Result = 5.74426e+06

Double-Precision - 256-bit AVX - Add/Sub
    GFlops = 22.592
    Result = 2.88426e+06

Single-Precision - 256-bit AVX - Multiply
    GFlops = 50.976
    Result = 6.46382e+06

Double-Precision - 256-bit AVX - Multiply
    GFlops = 25.488
    Result = 3.2681e+06

Single-Precision - 256-bit AVX - Multiply + Add
    GFlops = 67.872
    Result = 7.20537e+06

Double-Precision - 256-bit AVX - Multiply + Add
    GFlops = 32.88
    Result = 3.48666e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 96.768
    Result = 6.14762e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 48.48
    Result = 3.07488e+06

As you can see, the GFlops calculated in the Double Precision 256 bit FMA3 benchmark is 48.48, almost double the theoretical maximum I calculated for my core. (Note that I monitored the CPU frequency in Task manager, and it was fixed at 3.16, occasionally going to 3.17, throughout the benchmark) I considered that the benchmark may be wrong, so I inspected the binary of the 2017-Zen executable, and I found the following assembly for the Double Precision 256 bit FMA3 benchmark,

0000000000005130 :
    5130:       f3 0f 1e fa             endbr64
    5134:       c5 d0 57 ed             vxorps xmm5,xmm5,xmm5
    5138:       48 89 f1                mov    rcx,rsi
    513b:       0f 31                   rdtsc
    513d:       0f b6 c0                movzx  eax,al
    5140:       c4 e1 d3 2a d0          vcvtsi2sd xmm2,xmm5,rax
    5145:       c4 e2 7d 19 d2          vbroadcastsd ymm2,xmm2
    514a:       0f 31                   rdtsc
    514c:       0f b6 c0                movzx  eax,al
    514f:       c4 e1 d3 2a e0          vcvtsi2sd xmm4,xmm5,rax
    5154:       c4 e2 7d 19 e4          vbroadcastsd ymm4,xmm4
    5159:       0f 31                   rdtsc
    515b:       0f b6 c0                movzx  eax,al
    515e:       c4 e1 d3 2a d8          vcvtsi2sd xmm3,xmm5,rax
    5163:       c4 e2 7d 19 db          vbroadcastsd ymm3,xmm3
    5168:       0f 31                   rdtsc
    516a:       0f b6 c0                movzx  eax,al
    516d:       c4 61 d3 2a c0          vcvtsi2sd xmm8,xmm5,rax
    5172:       c4 42 7d 19 c0          vbroadcastsd ymm8,xmm8
    5177:       0f 31                   rdtsc
    5179:       0f b6 c0                movzx  eax,al
    517c:       c4 e1 d3 2a f8          vcvtsi2sd xmm7,xmm5,rax
    5181:       c4 e2 7d 19 ff          vbroadcastsd ymm7,xmm7
    5186:       0f 31                   rdtsc
    5188:       0f b6 c0                movzx  eax,al
    518b:       c4 e1 d3 2a f0          vcvtsi2sd xmm6,xmm5,rax
    5190:       c4 e2 7d 19 f6          vbroadcastsd ymm6,xmm6
    5195:       0f 31                   rdtsc
    5197:       0f b6 c0                movzx  eax,al
    519a:       c4 61 d3 2a e8          vcvtsi2sd xmm13,xmm5,rax
    519f:       c4 42 7d 19 ed          vbroadcastsd ymm13,xmm13
    51a4:       0f 31                   rdtsc
    51a6:       0f b6 c0                movzx  eax,al
    51a9:       c4 61 d3 2a d8          vcvtsi2sd xmm11,xmm5,rax
    51ae:       c4 42 7d 19 db          vbroadcastsd ymm11,xmm11
    51b3:       0f 31                   rdtsc
    51b5:       0f b6 c0                movzx  eax,al
    51b8:       c4 61 d3 2a c8          vcvtsi2sd xmm9,xmm5,rax
    51bd:       c4 42 7d 19 c9          vbroadcastsd ymm9,xmm9
    51c2:       0f 31                   rdtsc
    51c4:       0f b6 c0                movzx  eax,al
    51c7:       c4 61 d3 2a e0          vcvtsi2sd xmm12,xmm5,rax
    51cc:       c4 42 7d 19 e4          vbroadcastsd ymm12,xmm12
    51d1:       0f 31                   rdtsc
    51d3:       0f b6 c0                movzx  eax,al
    51d6:       c4 61 d3 2a d0          vcvtsi2sd xmm10,xmm5,rax
    51db:       c4 42 7d 19 d2          vbroadcastsd ymm10,xmm10
    51e0:       0f 31                   rdtsc
    51e2:       c5 fd 28 0d 76 15 00 00         vmovapd ymm1,YMMWORD PTR [rip+0x1576]        # 6760 
    51ea:       c5 fd 28 05 2e 16 00 00         vmovapd ymm0,YMMWORD PTR [rip+0x162e]        # 6820 
    51f2:       0f b6 c0                movzx  eax,al
    51f5:       c4 e1 d3 2a e8          vcvtsi2sd xmm5,xmm5,rax
    51fa:       c4 e2 7d 19 ed          vbroadcastsd ymm5,xmm5
    51ff:       90                      nop
    5200:       c4 e2 f5 b8 d0          vfmadd231pd ymm2,ymm1,ymm0
    5205:       c4 e2 f5 b8 e0          vfmadd231pd ymm4,ymm1,ymm0
    520a:       48 ff c9                dec    rcx
    520d:       c4 e2 f5 b8 d8          vfmadd231pd ymm3,ymm1,ymm0
    5212:       c4 62 f5 b8 c0          vfmadd231pd ymm8,ymm1,ymm0
    5217:       c4 e2 f5 b8 f8          vfmadd231pd ymm7,ymm1,ymm0
    521c:       c4 e2 f5 b8 f0          vfmadd231pd ymm6,ymm1,ymm0
    5221:       c4 62 f5 b8 e8          vfmadd231pd ymm13,ymm1,ymm0
    5226:       c4 62 f5 b8 d8          vfmadd231pd ymm11,ymm1,ymm0
    522b:       c4 62 f5 b8 c8          vfmadd231pd ymm9,ymm1,ymm0
    5230:       c4 62 f5 b8 e0          vfmadd231pd ymm12,ymm1,ymm0
    5235:       c4 62 f5 b8 d0          vfmadd231pd ymm10,ymm1,ymm0
    523a:       c4 e2 f5 b8 e8          vfmadd231pd ymm5,ymm1,ymm0
    523f:       c4 e2 f5 bc d0          vfnmadd231pd ymm2,ymm1,ymm0
    5244:       c4 e2 f5 bc e0          vfnmadd231pd ymm4,ymm1,ymm0
    5249:       c4 e2 f5 bc d8          vfnmadd231pd ymm3,ymm1,ymm0
    524e:       c4 62 f5 bc c0          vfnmadd231pd ymm8,ymm1,ymm0
    5253:       c4 e2 f5 bc f8          vfnmadd231pd ymm7,ymm1,ymm0
    5258:       c4 e2 f5 bc f0          vfnmadd231pd ymm6,ymm1,ymm0
    525d:       c4 62 f5 bc e8          vfnmadd231pd ymm13,ymm1,ymm0
    5262:       c4 62 f5 bc d8          vfnmadd231pd ymm11,ymm1,ymm0
    5267:       c4 62 f5 bc c8          vfnmadd231pd ymm9,ymm1,ymm0
    526c:       c4 62 f5 bc e0          vfnmadd231pd ymm12,ymm1,ymm0
    5271:       c4 62 f5 bc d0          vfnmadd231pd ymm10,ymm1,ymm0
    5276:       c4 e2 f5 bc e8          vfnmadd231pd ymm5,ymm1,ymm0
    527b:       c4 e2 f5 b8 d0          vfmadd231pd ymm2,ymm1,ymm0
    5280:       c4 e2 f5 b8 e0          vfmadd231pd ymm4,ymm1,ymm0
    5285:       c4 e2 f5 b8 d8          vfmadd231pd ymm3,ymm1,ymm0
    528a:       c4 62 f5 b8 c0          vfmadd231pd ymm8,ymm1,ymm0
    528f:       c4 e2 f5 b8 f8          vfmadd231pd ymm7,ymm1,ymm0
    5294:       c4 e2 f5 b8 f0          vfmadd231pd ymm6,ymm1,ymm0
    5299:       c4 62 f5 b8 e8          vfmadd231pd ymm13,ymm1,ymm0
    529e:       c4 62 f5 b8 d8          vfmadd231pd ymm11,ymm1,ymm0
    52a3:       c4 62 f5 b8 c8          vfmadd231pd ymm9,ymm1,ymm0
    52a8:       c4 62 f5 b8 e0          vfmadd231pd ymm12,ymm1,ymm0
    52ad:       c4 62 f5 b8 d0          vfmadd231pd ymm10,ymm1,ymm0
    52b2:       c4 e2 f5 b8 e8          vfmadd231pd ymm5,ymm1,ymm0
    52b7:       c4 e2 f5 bc d0          vfnmadd231pd ymm2,ymm1,ymm0
    52bc:       c4 e2 f5 bc e0          vfnmadd231pd ymm4,ymm1,ymm0
    52c1:       c4 e2 f5 bc d8          vfnmadd231pd ymm3,ymm1,ymm0
    52c6:       c4 62 f5 bc c0          vfnmadd231pd ymm8,ymm1,ymm0
    52cb:       c4 e2 f5 bc f8          vfnmadd231pd ymm7,ymm1,ymm0
    52d0:       c4 e2 f5 bc f0          vfnmadd231pd ymm6,ymm1,ymm0
    52d5:       c4 62 f5 bc e8          vfnmadd231pd ymm13,ymm1,ymm0
    52da:       c4 62 f5 bc d8          vfnmadd231pd ymm11,ymm1,ymm0
    52df:       c4 62 f5 bc c8          vfnmadd231pd ymm9,ymm1,ymm0
    52e4:       c4 62 f5 bc e0          vfnmadd231pd ymm12,ymm1,ymm0
    52e9:       c4 62 f5 bc d0          vfnmadd231pd ymm10,ymm1,ymm0
    52ee:       c4 e2 f5 bc e8          vfnmadd231pd ymm5,ymm1,ymm0
    52f3:       0f 85 07 ff ff ff       jne    5200 
    52f9:       c4 c1 6d 58 c5          vaddpd ymm0,ymm2,ymm13
    52fe:       c4 41 3d 58 c4          vaddpd ymm8,ymm8,ymm12
    5303:       c4 c1 5d 58 e3          vaddpd ymm4,ymm4,ymm11
    5308:       c4 c1 45 58 fa          vaddpd ymm7,ymm7,ymm10
    530d:       c4 c1 65 58 d9          vaddpd ymm3,ymm3,ymm9
    5312:       c5 cd 58 f5             vaddpd ymm6,ymm6,ymm5
    5316:       c4 c1 7d 58 c0          vaddpd ymm0,ymm0,ymm8
    531b:       c5 dd 58 e7             vaddpd ymm4,ymm4,ymm7
    531f:       c5 e5 58 de             vaddpd ymm3,ymm3,ymm6
    5323:       c5 fd 58 c4             vaddpd ymm0,ymm0,ymm4
    5327:       c5 fd 58 c3             vaddpd ymm0,ymm0,ymm3
    532b:       c4 e3 7d 19 c1 01       vextractf128 xmm1,ymm0,0x1
    5331:       c5 f1 58 c0             vaddpd xmm0,xmm1,xmm0
    5335:       c5 f9 15 c8             vunpckhpd xmm1,xmm0,xmm0
    5339:       c5 f9 58 c1             vaddpd xmm0,xmm0,xmm1
    533d:       c5 f8 77                vzeroupper
    5340:       c3                      ret

The number of FMA3 operations in the loop is indeed 48, giving us 48x2x4 operations per loop, which validates the code in the file f64v2_FMA_FMA3_c12x4.h in the repo. Clearly either I am wrong, or the benchmark is wrong, but I can't figure out who. If it matters, all code is being run in WSL2 on Windows 10.

Achieving More FMA3 Performance Than The Theoretical Maximum

Answers (0)

Related Questions