Reputation: 620
One can see from this tutorial on the usage of Intel MKL DFTs that Dr. Andrey E. Vladimirov uses the time elapsed during a task, namely t1-t0
, to compute the number of GigaFLOPS using GF/s = HztoPerf/(t1-t0)
where HztoPerf = 5.0 * 1e-9 * double(fft_size) * log2(double(fft_size)) * double(num_fft)
.
Is this a general formula? If not, how do I deduce the average GF/s for my CPU (Intel Xeon E5-1660 at 3 GHz with 8 cores) if I know the time elapsed to run a computation (e.g. involving various FFTs)?
Upvotes: 0
Views: 626
Reputation: 363912
You have to know how many FP operations your problem requires. Then you divide that by time.
1e-9
accounts for the Giga = 10^9 metric prefix. Without that, you'd have FLOP/s not GFLOP/s if you divide FLoating point OPeration count by seconds.
5.0 * fft_size * log2(fft_size)
appears to be the number of FP ops per FFT.
An efficient FFT is O(n log2(n)), and apparently this implementation has a constant factor of 5. (Or possibly that's including some work done using the result?)
num_fft
is presumably the total number of FFTs of that size done, i.e. the repeat count. So the product of all those things is the number of FP ops actually done during computation of the FFT.
Hardware performance counters on Intel CPUs can record number of FLOPs (even counting FMAs as 2): there are events like fp_arith_inst_retired.256b_packed_double
for various SIMD widths.
perf
has a GFLOPs
"metric group" you can use that enables the relevant events and calculates it for you:
perf stat --all-user -M GFLOPs ./my_program my args
Counting only in user-space is probably redundant; kernel code might use SIMD for software RAID5/6, but not in interrupt handlers and probably not system calls. And not FP math.
Example on my i7-6700k Skylake
$ perf stat --all-user -M GFLOPs awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<100000000;i++){}}':
0 fp_arith_inst_retired.256b_packed_single # 0.03 GFLOPs (66.58%)
99,934,901 fp_arith_inst_retired.scalar_double (66.68%)
0 fp_arith_inst_retired.128b_packed_single (66.71%)
0 fp_arith_inst_retired.scalar_single (66.71%)
0 fp_arith_inst_retired.256b_packed_double (66.71%)
0 fp_arith_inst_retired.128b_packed_double (66.62%)
3,352,766,500 ns duration_time
3.352766500 seconds time elapsed
3.347268000 seconds user
0.000000000 seconds sys
Unfortunately it had to multiplex between those events, since there were more than 4, and hyperthreading is enabled, so the total number of scalar double-precision FP operations (99,934,901) was measured a bit lower than the awk
loop iteration count. With just -e task_clock,cycles,instructions,fp_arith_inst_retired.scalar_double
, it came out at exactly 100,000,000
counts, since apparently gawk did no other FP operations.
Of course awk
is not a high-FP-throughput program, and only used scalar FP math. Numeric variables in awk are double-precision, like JavaScript, but unlike JS it doesn't JIT, let alone take advantage of the ability to do them as integer.
Upvotes: 2