Reputation: 140
For an assignment, I am trying to calculate the theoretical maximum achievable GFLOPS/sec of a single core of my processor, an AMD Ryzen 9 5900HS. According to Agner Fog's tables for a Zen 3 AMD processor (since the R9 5900HS is a Zen 3 processor), the reciprocal throughput of an vfmadd132pd
operation is 0.5, giving us a maximum of 2 FMA3 operations per clock. Since each AVX2 vector can hold 4 doubles, that gives us a theoretical maximum of 8 FLOPS/clock. Since my cores average frequency is 3.16 clocks/sec, we get an average theoretical maximum of 25 GFLOPS/sec. All good so far.
Now, the issue arises when I run the Flops benchmark. Running the recommended 2017-Zen binary (both using the precompiled binary present in the repo, and compiling locally), gives me the results
Running Zen tuned binary with 1 thread...
Single-Precision - 128-bit AVX - Add/Sub
GFlops = 22.528
Result = 2.86329e+06
Double-Precision - 128-bit AVX - Add/Sub
GFlops = 11.296
Result = 1.43905e+06
Single-Precision - 128-bit AVX - Multiply
GFlops = 25.536
Result = 3.26002e+06
Double-Precision - 128-bit AVX - Multiply
GFlops = 12.768
Result = 1.63224e+06
Single-Precision - 128-bit AVX - Multiply + Add
GFlops = 33.936
Result = 3.62228e+06
Double-Precision - 128-bit AVX - Multiply + Add
GFlops = 16.992
Result = 1.80521e+06
Single-Precision - 128-bit FMA3 - Fused Multiply Add
GFlops = 48.48
Result = 3.09552e+06
Double-Precision - 128-bit FMA3 - Fused Multiply Add
GFlops = 24.24
Result = 1.53887e+06
Single-Precision - 256-bit AVX - Add/Sub
GFlops = 45.12
Result = 5.74426e+06
Double-Precision - 256-bit AVX - Add/Sub
GFlops = 22.592
Result = 2.88426e+06
Single-Precision - 256-bit AVX - Multiply
GFlops = 50.976
Result = 6.46382e+06
Double-Precision - 256-bit AVX - Multiply
GFlops = 25.488
Result = 3.2681e+06
Single-Precision - 256-bit AVX - Multiply + Add
GFlops = 67.872
Result = 7.20537e+06
Double-Precision - 256-bit AVX - Multiply + Add
GFlops = 32.88
Result = 3.48666e+06
Single-Precision - 256-bit FMA3 - Fused Multiply Add
GFlops = 96.768
Result = 6.14762e+06
Double-Precision - 256-bit FMA3 - Fused Multiply Add
GFlops = 48.48
Result = 3.07488e+06
As you can see, the GFlops calculated in the Double Precision 256 bit FMA3 benchmark is 48.48, almost double the theoretical maximum I calculated for my core. (Note that I monitored the CPU frequency in Task manager, and it was fixed at 3.16, occasionally going to 3.17, throughout the benchmark) I considered that the benchmark may be wrong, so I inspected the binary of the 2017-Zen executable, and I found the following assembly for the Double Precision 256 bit FMA3 benchmark,
0000000000005130 <Flops::f64v2_FMA_FMA3_c12x4::run_kernel(unsigned long) const>:
5130: f3 0f 1e fa endbr64
5134: c5 d0 57 ed vxorps xmm5,xmm5,xmm5
5138: 48 89 f1 mov rcx,rsi
513b: 0f 31 rdtsc
513d: 0f b6 c0 movzx eax,al
5140: c4 e1 d3 2a d0 vcvtsi2sd xmm2,xmm5,rax
5145: c4 e2 7d 19 d2 vbroadcastsd ymm2,xmm2
514a: 0f 31 rdtsc
514c: 0f b6 c0 movzx eax,al
514f: c4 e1 d3 2a e0 vcvtsi2sd xmm4,xmm5,rax
5154: c4 e2 7d 19 e4 vbroadcastsd ymm4,xmm4
5159: 0f 31 rdtsc
515b: 0f b6 c0 movzx eax,al
515e: c4 e1 d3 2a d8 vcvtsi2sd xmm3,xmm5,rax
5163: c4 e2 7d 19 db vbroadcastsd ymm3,xmm3
5168: 0f 31 rdtsc
516a: 0f b6 c0 movzx eax,al
516d: c4 61 d3 2a c0 vcvtsi2sd xmm8,xmm5,rax
5172: c4 42 7d 19 c0 vbroadcastsd ymm8,xmm8
5177: 0f 31 rdtsc
5179: 0f b6 c0 movzx eax,al
517c: c4 e1 d3 2a f8 vcvtsi2sd xmm7,xmm5,rax
5181: c4 e2 7d 19 ff vbroadcastsd ymm7,xmm7
5186: 0f 31 rdtsc
5188: 0f b6 c0 movzx eax,al
518b: c4 e1 d3 2a f0 vcvtsi2sd xmm6,xmm5,rax
5190: c4 e2 7d 19 f6 vbroadcastsd ymm6,xmm6
5195: 0f 31 rdtsc
5197: 0f b6 c0 movzx eax,al
519a: c4 61 d3 2a e8 vcvtsi2sd xmm13,xmm5,rax
519f: c4 42 7d 19 ed vbroadcastsd ymm13,xmm13
51a4: 0f 31 rdtsc
51a6: 0f b6 c0 movzx eax,al
51a9: c4 61 d3 2a d8 vcvtsi2sd xmm11,xmm5,rax
51ae: c4 42 7d 19 db vbroadcastsd ymm11,xmm11
51b3: 0f 31 rdtsc
51b5: 0f b6 c0 movzx eax,al
51b8: c4 61 d3 2a c8 vcvtsi2sd xmm9,xmm5,rax
51bd: c4 42 7d 19 c9 vbroadcastsd ymm9,xmm9
51c2: 0f 31 rdtsc
51c4: 0f b6 c0 movzx eax,al
51c7: c4 61 d3 2a e0 vcvtsi2sd xmm12,xmm5,rax
51cc: c4 42 7d 19 e4 vbroadcastsd ymm12,xmm12
51d1: 0f 31 rdtsc
51d3: 0f b6 c0 movzx eax,al
51d6: c4 61 d3 2a d0 vcvtsi2sd xmm10,xmm5,rax
51db: c4 42 7d 19 d2 vbroadcastsd ymm10,xmm10
51e0: 0f 31 rdtsc
51e2: c5 fd 28 0d 76 15 00 00 vmovapd ymm1,YMMWORD PTR [rip+0x1576] # 6760 <typeinfo name for Flops::f64v2_FMA_FMA3_c12x4+0xb0>
51ea: c5 fd 28 05 2e 16 00 00 vmovapd ymm0,YMMWORD PTR [rip+0x162e] # 6820 <typeinfo name for Flops::f64v2_FMA_FMA3_c12x4+0x170>
51f2: 0f b6 c0 movzx eax,al
51f5: c4 e1 d3 2a e8 vcvtsi2sd xmm5,xmm5,rax
51fa: c4 e2 7d 19 ed vbroadcastsd ymm5,xmm5
51ff: 90 nop
5200: c4 e2 f5 b8 d0 vfmadd231pd ymm2,ymm1,ymm0
5205: c4 e2 f5 b8 e0 vfmadd231pd ymm4,ymm1,ymm0
520a: 48 ff c9 dec rcx
520d: c4 e2 f5 b8 d8 vfmadd231pd ymm3,ymm1,ymm0
5212: c4 62 f5 b8 c0 vfmadd231pd ymm8,ymm1,ymm0
5217: c4 e2 f5 b8 f8 vfmadd231pd ymm7,ymm1,ymm0
521c: c4 e2 f5 b8 f0 vfmadd231pd ymm6,ymm1,ymm0
5221: c4 62 f5 b8 e8 vfmadd231pd ymm13,ymm1,ymm0
5226: c4 62 f5 b8 d8 vfmadd231pd ymm11,ymm1,ymm0
522b: c4 62 f5 b8 c8 vfmadd231pd ymm9,ymm1,ymm0
5230: c4 62 f5 b8 e0 vfmadd231pd ymm12,ymm1,ymm0
5235: c4 62 f5 b8 d0 vfmadd231pd ymm10,ymm1,ymm0
523a: c4 e2 f5 b8 e8 vfmadd231pd ymm5,ymm1,ymm0
523f: c4 e2 f5 bc d0 vfnmadd231pd ymm2,ymm1,ymm0
5244: c4 e2 f5 bc e0 vfnmadd231pd ymm4,ymm1,ymm0
5249: c4 e2 f5 bc d8 vfnmadd231pd ymm3,ymm1,ymm0
524e: c4 62 f5 bc c0 vfnmadd231pd ymm8,ymm1,ymm0
5253: c4 e2 f5 bc f8 vfnmadd231pd ymm7,ymm1,ymm0
5258: c4 e2 f5 bc f0 vfnmadd231pd ymm6,ymm1,ymm0
525d: c4 62 f5 bc e8 vfnmadd231pd ymm13,ymm1,ymm0
5262: c4 62 f5 bc d8 vfnmadd231pd ymm11,ymm1,ymm0
5267: c4 62 f5 bc c8 vfnmadd231pd ymm9,ymm1,ymm0
526c: c4 62 f5 bc e0 vfnmadd231pd ymm12,ymm1,ymm0
5271: c4 62 f5 bc d0 vfnmadd231pd ymm10,ymm1,ymm0
5276: c4 e2 f5 bc e8 vfnmadd231pd ymm5,ymm1,ymm0
527b: c4 e2 f5 b8 d0 vfmadd231pd ymm2,ymm1,ymm0
5280: c4 e2 f5 b8 e0 vfmadd231pd ymm4,ymm1,ymm0
5285: c4 e2 f5 b8 d8 vfmadd231pd ymm3,ymm1,ymm0
528a: c4 62 f5 b8 c0 vfmadd231pd ymm8,ymm1,ymm0
528f: c4 e2 f5 b8 f8 vfmadd231pd ymm7,ymm1,ymm0
5294: c4 e2 f5 b8 f0 vfmadd231pd ymm6,ymm1,ymm0
5299: c4 62 f5 b8 e8 vfmadd231pd ymm13,ymm1,ymm0
529e: c4 62 f5 b8 d8 vfmadd231pd ymm11,ymm1,ymm0
52a3: c4 62 f5 b8 c8 vfmadd231pd ymm9,ymm1,ymm0
52a8: c4 62 f5 b8 e0 vfmadd231pd ymm12,ymm1,ymm0
52ad: c4 62 f5 b8 d0 vfmadd231pd ymm10,ymm1,ymm0
52b2: c4 e2 f5 b8 e8 vfmadd231pd ymm5,ymm1,ymm0
52b7: c4 e2 f5 bc d0 vfnmadd231pd ymm2,ymm1,ymm0
52bc: c4 e2 f5 bc e0 vfnmadd231pd ymm4,ymm1,ymm0
52c1: c4 e2 f5 bc d8 vfnmadd231pd ymm3,ymm1,ymm0
52c6: c4 62 f5 bc c0 vfnmadd231pd ymm8,ymm1,ymm0
52cb: c4 e2 f5 bc f8 vfnmadd231pd ymm7,ymm1,ymm0
52d0: c4 e2 f5 bc f0 vfnmadd231pd ymm6,ymm1,ymm0
52d5: c4 62 f5 bc e8 vfnmadd231pd ymm13,ymm1,ymm0
52da: c4 62 f5 bc d8 vfnmadd231pd ymm11,ymm1,ymm0
52df: c4 62 f5 bc c8 vfnmadd231pd ymm9,ymm1,ymm0
52e4: c4 62 f5 bc e0 vfnmadd231pd ymm12,ymm1,ymm0
52e9: c4 62 f5 bc d0 vfnmadd231pd ymm10,ymm1,ymm0
52ee: c4 e2 f5 bc e8 vfnmadd231pd ymm5,ymm1,ymm0
52f3: 0f 85 07 ff ff ff jne 5200 <Flops::f64v2_FMA_FMA3_c12x4::run_kernel(unsigned long) const+0xd0>
52f9: c4 c1 6d 58 c5 vaddpd ymm0,ymm2,ymm13
52fe: c4 41 3d 58 c4 vaddpd ymm8,ymm8,ymm12
5303: c4 c1 5d 58 e3 vaddpd ymm4,ymm4,ymm11
5308: c4 c1 45 58 fa vaddpd ymm7,ymm7,ymm10
530d: c4 c1 65 58 d9 vaddpd ymm3,ymm3,ymm9
5312: c5 cd 58 f5 vaddpd ymm6,ymm6,ymm5
5316: c4 c1 7d 58 c0 vaddpd ymm0,ymm0,ymm8
531b: c5 dd 58 e7 vaddpd ymm4,ymm4,ymm7
531f: c5 e5 58 de vaddpd ymm3,ymm3,ymm6
5323: c5 fd 58 c4 vaddpd ymm0,ymm0,ymm4
5327: c5 fd 58 c3 vaddpd ymm0,ymm0,ymm3
532b: c4 e3 7d 19 c1 01 vextractf128 xmm1,ymm0,0x1
5331: c5 f1 58 c0 vaddpd xmm0,xmm1,xmm0
5335: c5 f9 15 c8 vunpckhpd xmm1,xmm0,xmm0
5339: c5 f9 58 c1 vaddpd xmm0,xmm0,xmm1
533d: c5 f8 77 vzeroupper
5340: c3 ret
The number of FMA3 operations in the loop is indeed 48, giving us 48x2x4 operations per loop, which validates the code in the file f64v2_FMA_FMA3_c12x4.h in the repo. Clearly either I am wrong, or the benchmark is wrong, but I can't figure out who. If it matters, all code is being run in WSL2 on Windows 10.
Upvotes: 1
Views: 109