Reputation: 648
Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.
Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.
Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6
(n^3
for matrix-matrix multiplication, performed on n
slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.
Code snippet in main function:
// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
#pragma noinline
do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);
Code snippet: do_timed_run:
void do_timed_run(const std::size_t& n, double& avg_dur)
{
// create the data and lay first touch
auto operand0 = matrix<double>(n, n);
auto operand1 = tensor<double>(n, n, n);
auto result = tensor<double>(n, n, n);
// first touch
#pragma omp parallel
{
set_first_touch(operand1);
set_first_touch(result);
}
// do the experiment
const auto dur1 = omp_get_wtime() * 1E+6;
#pragma omp parallel firstprivate(operand0)
{
#pragma noinline
transp_matrix_tensor_mult(operand0, operand1, result);
}
const auto dur2 = omp_get_wtime() * 1E+6;
avg_dur += dur2 - dur1;
}
Notes:
transp_matrix_tensor_mult
because I don't think it is relevant.#pragma noinline
is a debug fixture I'm using to be able to better understand the output of the disassembler.And now for the disassembly of the function do_timed_run
:
0000000000403a20 <_Z12do_timed_runRKmRd>:
403a20: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403a27: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403a2e: 00
403a2f: 48 89 fd mov %rdi,%rbp
403a32: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403a39: 00
403a3a: 48 89 f3 mov %rsi,%rbx
403a3d: 48 89 ee mov %rbp,%rsi
403a40: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403a45: 48 89 ea mov %rbp,%rdx
403a48: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403a4f: 00
403a50: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403a57: 00
403a58: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403a5f: 00
403a60: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403a67: 00
403a68: e8 03 f8 ff ff callq 403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
403a6d: 48 89 ee mov %rbp,%rsi
403a70: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403a75: 48 89 ea mov %rbp,%rdx
403a78: 48 89 e9 mov %rbp,%rcx
403a7b: e8 80 f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a80: 48 89 ee mov %rbp,%rsi
403a83: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403a88: 48 89 ea mov %rbp,%rdx
403a8b: 48 89 e9 mov %rbp,%rcx
403a8e: e8 6d f8 ff ff callq 403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
403a93: bf 88 f3 44 00 mov $0x44f388,%edi
403a98: e8 53 f7 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403a9d: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403aa4: bf c0 f3 44 00 mov $0x44f3c0,%edi
403aa9: 33 c0 xor %eax,%eax
403aab: e8 20 f6 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403ab0: 85 c0 test %eax,%eax
403ab2: 74 21 je 403ad5 <_Z12do_timed_runRKmRd+0xb5>
403ab4: ba a5 3c 40 00 mov $0x403ca5,%edx
403ab9: bf c0 f3 44 00 mov $0x44f3c0,%edi
403abe: be 02 00 00 00 mov $0x2,%esi
403ac3: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403ac8: 33 c0 xor %eax,%eax
403aca: 4c 8d 41 38 lea 0x38(%rcx),%r8
403ace: e8 cd f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403ad3: eb 41 jmp 403b16 <_Z12do_timed_runRKmRd+0xf6>
403ad5: bf c0 f3 44 00 mov $0x44f3c0,%edi
403ada: 33 c0 xor %eax,%eax
403adc: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403ae3: e8 58 f7 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403ae8: be 9c 13 47 00 mov $0x47139c,%esi
403aed: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403af4: 00
403af5: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403afa: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403afe: e8 a2 01 00 00 callq 403ca5 <_Z12do_timed_runRKmRd+0x285>
403b03: bf c0 f3 44 00 mov $0x44f3c0,%edi
403b08: 33 c0 xor %eax,%eax
403b0a: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b11: e8 aa f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b16: e8 85 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b1b: c5 fb 11 04 24 vmovsd %xmm0,(%rsp)
403b20: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b25: 33 c0 xor %eax,%eax
403b27: e8 a4 f5 ff ff callq 4030d0 <__kmpc_ok_to_fork@plt>
403b2c: 85 c0 test %eax,%eax
403b2e: 74 25 je 403b55 <_Z12do_timed_runRKmRd+0x135>
403b30: ba 0b 3c 40 00 mov $0x403c0b,%edx
403b35: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b3a: be 03 00 00 00 mov $0x3,%esi
403b3f: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
403b44: 33 c0 xor %eax,%eax
403b46: 4c 8d 41 38 lea 0x38(%rcx),%r8
403b4a: 4c 8d 49 70 lea 0x70(%rcx),%r9
403b4e: e8 4d f5 ff ff callq 4030a0 <__kmpc_fork_call@plt>
403b53: eb 45 jmp 403b9a <_Z12do_timed_runRKmRd+0x17a>
403b55: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b5a: 33 c0 xor %eax,%eax
403b5c: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b63: e8 d8 f6 ff ff callq 403240 <__kmpc_serialized_parallel@plt>
403b68: be a0 13 47 00 mov $0x4713a0,%esi
403b6d: 48 8d bc 24 d0 00 00 lea 0xd0(%rsp),%rdi
403b74: 00
403b75: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
403b7a: 48 8d 4a 38 lea 0x38(%rdx),%rcx
403b7e: 4c 8d 42 70 lea 0x70(%rdx),%r8
403b82: e8 84 00 00 00 callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
403b87: bf f8 f3 44 00 mov $0x44f3f8,%edi
403b8c: 33 c0 xor %eax,%eax
403b8e: 8b b4 24 d0 00 00 00 mov 0xd0(%rsp),%esi
403b95: e8 26 f7 ff ff callq 4032c0 <__kmpc_end_serialized_parallel@plt>
403b9a: e8 01 f6 ff ff callq 4031a0 <omp_get_wtime@plt>
403b9f: c5 fb 5c 0c 24 vsubsd (%rsp),%xmm0,%xmm1
403ba4: c5 fb 10 05 cc c4 01 vmovsd 0x1c4cc(%rip),%xmm0 # 420078 <alpha_beta.61562.0.0.28+0x28>
403bab: 00
403bac: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403bb1: c4 e2 f9 a9 0b vfmadd213sd (%rbx),%xmm0,%xmm1
403bb6: c5 fb 11 0b vmovsd %xmm1,(%rbx)
403bba: e8 71 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bbf: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403bc4: e8 67 f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bc9: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403bce: e8 5d f5 ff ff callq 403130 <_ZN5s3dft9data_packIdED1Ev@plt>
403bd3: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403bda: 00
403bdb: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403be2: 00
403be3: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403bea: 00
403beb: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403bf2: 00
403bf3: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403bfa: 00
403bfb: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c02: 00
403c03: 48 81 c4 d8 00 00 00 add $0xd8,%rsp
403c0a: c3 retq
403c0b: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403c12: 4c 89 c6 mov %r8,%rsi
403c15: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403c1c: 00
403c1d: 4c 8d 24 24 lea (%rsp),%r12
403c21: 4c 89 e7 mov %r12,%rdi
403c24: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403c2b: 00
403c2c: 48 89 cd mov %rcx,%rbp
403c2f: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403c36: 00
403c37: 48 89 d3 mov %rdx,%rbx
403c3a: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403c41: 00
403c42: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403c49: 00
403c4a: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403c51: 00
403c52: e8 49 03 00 00 callq 403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
403c57: 4c 89 e7 mov %r12,%rdi
403c5a: 48 89 de mov %rbx,%rsi
403c5d: 48 89 ea mov %rbp,%rdx
403c60: e8 8b 01 00 00 callq 403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
403c65: 4c 89 e7 mov %r12,%rdi
403c68: e8 63 01 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev> # <--- ...and here it ends
403c6d: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403c74: 00
403c75: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403c7c: 00
403c7d: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403c84: 00
403c85: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403c8c: 00
403c8d: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403c94: 00
403c95: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403c9c: 00
403c9d: 48 81 c4 d8 00 00 00 add $0xd8,%rsp
403ca4: c3 retq
403ca5: 48 81 ec d8 00 00 00 sub $0xd8,%rsp
403cac: 48 89 d7 mov %rdx,%rdi
403caf: 48 89 ac 24 c8 00 00 mov %rbp,0xc8(%rsp)
403cb6: 00
403cb7: 48 89 9c 24 c0 00 00 mov %rbx,0xc0(%rsp)
403cbe: 00
403cbf: 48 89 cb mov %rcx,%rbx
403cc2: 4c 89 bc 24 a0 00 00 mov %r15,0xa0(%rsp)
403cc9: 00
403cca: 4c 89 b4 24 a8 00 00 mov %r14,0xa8(%rsp)
403cd1: 00
403cd2: 4c 89 ac 24 b0 00 00 mov %r13,0xb0(%rsp)
403cd9: 00
403cda: 4c 89 a4 24 b8 00 00 mov %r12,0xb8(%rsp)
403ce1: 00
403ce2: e8 99 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
403ce7: 48 89 df mov %rbx,%rdi
403cea: e8 91 f4 ff ff callq 403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
403cef: 4c 8b bc 24 a0 00 00 mov 0xa0(%rsp),%r15
403cf6: 00
403cf7: 4c 8b b4 24 a8 00 00 mov 0xa8(%rsp),%r14
403cfe: 00
403cff: 4c 8b ac 24 b0 00 00 mov 0xb0(%rsp),%r13
403d06: 00
403d07: 4c 8b a4 24 b8 00 00 mov 0xb8(%rsp),%r12
403d0e: 00
403d0f: 48 8b 9c 24 c0 00 00 mov 0xc0(%rsp),%rbx
403d16: 00
403d17: 48 8b ac 24 c8 00 00 mov 0xc8(%rsp),%rbp
403d1e: 00
403d1f: 48 81 c4 d8 00 00 00 add $0xd8,%rsp
403d26: c3 retq
403d27: 48 89 04 24 mov %rax,(%rsp)
403d2b: bf 30 f4 44 00 mov $0x44f430,%edi
403d30: e8 bb f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d35: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d3c: 48 8d 7c 24 40 lea 0x40(%rsp),%rdi
403d41: e8 9a 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d46: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
403d4b: e8 90 00 00 00 callq 403de0 <_ZN5s3dft6tensorIdED1Ev>
403d50: 48 8d 7c 24 78 lea 0x78(%rsp),%rdi
403d55: e8 76 00 00 00 callq 403dd0 <_ZN5s3dft6matrixIdED1Ev>
403d5a: 48 8b 3c 24 mov (%rsp),%rdi
403d5e: e8 5d f3 ff ff callq 4030c0 <_Unwind_Resume@plt>
403d63: 48 89 04 24 mov %rax,(%rsp)
403d67: bf 68 f4 44 00 mov $0x44f468,%edi
403d6c: e8 7f f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d71: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d78: eb cc jmp 403d46 <_Z12do_timed_runRKmRd+0x326>
403d7a: 48 89 04 24 mov %rax,(%rsp)
403d7e: bf a0 f4 44 00 mov $0x44f4a0,%edi
403d83: e8 68 f4 ff ff callq 4031f0 <__kmpc_global_thread_num@plt>
403d88: 89 84 24 d0 00 00 00 mov %eax,0xd0(%rsp)
403d8f: eb bf jmp 403d50 <_Z12do_timed_runRKmRd+0x330>
403d91: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
403d98: 00
403d99: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
Primary questions:
Secondary questions:
retq
instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?Please consider that I have only provided the information which I think is relevant. Additional information will be gladly provided upon request. Thank you in advance for your time.
Edit:
@PeterCordes I did build with debug symbols enabled. The assembly posted above has been obtained using objdump
, which somehow did not retrieve the required symbols. Here's (a snippet of) the assembly obtained using icpc
:
# omp_get_wtime()
call omp_get_wtime #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
# LOE rbx xmm0
..B4.12: # Preds ..B4.11
# Execution count [1.00e+00]
..LN420:
vmovsd %xmm0, (%rsp) #122.23[spill]
..LN421:
# LOE rbx
..B4.13: # Preds ..B4.12
# Execution count [1.00e+00]
..LN422:
.loc 1 123 is_stmt 1
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN423:
xorl %eax, %eax #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
call __kmpc_ok_to_fork #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
# LOE rbx eax
..B4.14: # Preds ..B4.13
# Execution count [1.00e+00]
..LN426:
testl %eax, %eax #123.5
..LN427:
je ..B4.17 # Prob 50% #123.5
..LN428:
# LOE rbx
..B4.15: # Preds ..B4.14
# Execution count [0.00e+00]
..LN429:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN430:
xorl %edx, %edx #123.5
..LN431:
incq %rdx #123.5
..LN432:
xorl %eax, %eax #123.5
..LN433:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
call __kmpc_push_num_threads #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
# LOE rbx
..B4.16: # Preds ..B4.15
# Execution count [0.00e+00]
..LN436:
movl $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN438:
movl $3, %esi #123.5
..LN439:
lea 8(%rsp), %rcx #123.5
..LN440:
xorl %eax, %eax #123.5
..LN441:
lea 56(%rcx), %r8 #123.5
..LN442:
lea 112(%rcx), %r9 #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
call __kmpc_fork_call #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
jmp ..B4.20 # Prob 100% #123.5
..LN445:
# LOE rbx
..B4.17: # Preds ..B4.14
# Execution count [0.00e+00]
..LN446:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN447:
xorl %eax, %eax #123.5
..LN448:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
call __kmpc_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
# LOE rbx
..B4.18: # Preds ..B4.17
# Execution count [0.00e+00]
..LN451:
movl $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi #123.5
..LN452:
lea 208(%rsp), %rdi #123.5
..LN453:
lea 8(%rsp), %rdx #123.5
..LN454:
lea 56(%rdx), %rcx #123.5
..LN455:
lea 112(%rdx), %r8 #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
call L__Z12do_timed_runRKmRd_123__par_region1_2.5 #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
# LOE rbx
..B4.19: # Preds ..B4.18
# Execution count [0.00e+00]
..LN458:
movl $.2.40_2_kmpc_loc_struct_pack.65, %edi #123.5
..LN459:
xorl %eax, %eax #123.5
..LN460:
movl 208(%rsp), %esi #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
call __kmpc_end_serialized_parallel #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
# LOE rbx
..B4.20: # Preds ..B4.16 ..B4.19
# Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
.loc 1 128 is_stmt 1
# omp_get_wtime()
call omp_get_wtime #128.23
As you can see, the output is very verbose and harder to read.
Upvotes: 2
Views: 98
Reputation: 364987
1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384
GFLOP/S. (Using vectors of 4 double
s, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(as well as the __kmpc_end_serialized_parallel
stuff).
There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp
should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.
As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)
- Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
Once you're already in an if
block, you unconditionally know the else
block should not run, so you jmp
over it instead of jcc
(even if FLAGS
were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc
to it, then it jmp
s back to after the other side. That allows the fast path to be contiguous with no taken branches.
- Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?
Duplicate ret
comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret
instead of jumping to a ret
. (And copies of any cleanup necessary, like restoring regs and stack pointer.)
Upvotes: 3