how will unrolling affect the cycles per element count CPE

Question

how do I calculate CPE (cycles per element) with these code snippets?
what is the difference in the CPE between the 2 given code snippets?

I have this piece of code

void randomFunction(float a[],float Tb[],float c[],long int n){
        int i,j,k;
        for(i=0;i


This is the assembly for the inner-most loop, from GCC10.3 -O2 (https://godbolt.org/z/cWE16rW1r), for a version of the function with a local float temp=0; that it returns so the loop won't be optimized away:
.L4:
    movss   (%rcx,%rax), %xmm0
    mulss   (%rdx,%rax), %xmm0
    addq    $4, %rax
    addss   %xmm0, %xmm1
    cmpq    %rax, %rsi
    jne     .L4


Now I am trying to 'optimise it' using unrolling.
void randomUnrollingFunction(float a[],float Tb[],float c[],long int n){
    int i,j,k;
    for(i=0;i

I am wondering what is the the estimated CPE that will be achieved with this unrolling by factor 2.

CPE is number of cycles/ number of instructions
This is the information for the latency:  
Thank you for any help in advance!

Peter Cordes · Accepted Answer

Your loop entirely bottlenecks on addss latency (float add), at 3 cycles per 1 element. Out-of-order exec lets the other work run in the "shadow" of this. Assuming your Haswell-like CPU has the memory bandwidth to keep up¹

This way of unrolling doesn't help at all, not changing the serial dependency chain unless you compiled with -ffast-math or something to let a compiler turn it into

temp1 += a[...+0]* Tb[...+0];
temp2 += a[...+1]* Tb[...+1];

Or add pairs before feeding them into that serial dependency, like

temp +=  a[]*Tb[] + a[+1]*Tb[+1];

One long serial dependency chain is the worst, and also numerically not great: pairwise summation (or especially just a step in that direction using multiple accumulators) would be numerically better and also perform better. (Simd matmul program gives different numerical results).

(If your multiple accumulators are 4 elements of on SIMD vector, you can do 4x the amount of work with the same pattern of asm instructions. But you then need to unroll with multiple vectors, because addps has the same performance characteristics as addss on modern x86 CPUs.)

Footnote 1: Two sequential read streams of 4 bytes each per 3 cycles; certainly a desktop Haswell could keep up, and probably even a Xeon competing with many other cores for memory bandwidth. But the reads from a[] probably hit in cache, because a[i*n+k] is the same row repeatedly until we move on to the next outer-loop iteration. So only 1 row of a[] has to stay hot in cache (to get hits next middle iteration) while we scan through a row of Tb. So a[] has to come in from DRAM once total, if n isn't huge, but we loop over the whole Tb[] in order n times.

More detailed version

See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - look for the dependency chains (in this case the addss into %xmm1). Also A whirlwind introduction to dataflow graphs.

Then look for throughput bottlenecks in the back-end and front-end. In your case, latency dominates. (Assuming the front-end matches this Haswell back-end, although it wouldn't take much to keep up with this back-end latency bottleneck. Also, I hate that they number their "functional units" from 1, instead of following Intel's numbering of ports 0,1,5,6 having ALUs. Haswell's FP adder is on port 1, and ports 2/3 are load/store-AGU units, etc.)

ADDSS has 3 cycle latency, so temp += ... can only execute once per 3 cycles. The load / MULSS operations just independently prepare inputs for ADDSS, and the loop overhead is also independent.

Note that if not for the latency bottleneck, your loop would bottleneck on the front-end on a real Haswell (4 fused-domain uops per cycle), not on back-end functional units. The loop is 5 fused-domain uops, assuming macro-fusion of the cmp/jne, and that Haswell can keep the memory-source addss micro-fused despite the indexed addressing mode. (Sandybridge would un-laminate it.)

In the general case, knowing the back-end functional units is not sufficient. The front-end is also a common bottleneck, especially in loops that have to store something, like an actual matmul.

But thanks to the serial dependency on ADDSS (which actually carries across outer loops), the only thing that matters is that dependency chain.

Even a branch mispredict on the last iteration of the inner loop (when the branch is not-taken instead of normally taken), that will just give the back end more time to chew through those pending ADDSS operations while the front-end sorts itself out and starts in on the next inner loop.

Since you unrolled in a way that doesn't change the serial dependency, it makes zero difference to performance except for tiny n. (For tiny n, the whole thing can potentially overlap some of this work with independent work in the caller before/after the call. In that case, saving instructions could be helpful, also allowing out-of-order exec to "see farther". Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths is a case where OoO exec can (partially) overlap two independent imul dep chains that in program-order are one after the other.

Of course at that point, you're considering code outside what you're showing. And even for n=10, that's 10^3 = 1000 inner iterations, and Haswell's ROB is only 192 uops large, with an RS of 60 entries. (https://www.realworldtech.com/haswell-cpu/3/).

Unrolling in a useful way

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) re: unrolling in ways that do create more instruction-level parallelism, hiding FP dependency chains.

Unrolling differently, only summing once into temp per loop iteration, will keep the same cycles per iteration while still doubling the number of elements you process.

            for(k=0;k


Obviously you can keep doing this until you run into front-end or back-end throughput limits, like one add per clock.  The above version does two adds per 3 clocks.
Your "functional unit" table doesn't list FMA (fused multiply-add), but real Haswell has it, with identical performance to FP mul.  Iit wouldn't help much if any because your current loop structure does 2 loads per mul+add, so reducing that to 2 loads and one FMA would still bottleneck on loads.  Might help with front-end throughput?
What might help reduce loads is unrolling over one of the outer loops, using both a[i*n+k] and a[(i+1)*n+k] with one Tb[j*n+k].  That of course changes the order of the calculation, so isn't legal for a compiler without -ffast-math because FP math isn't strictly associative.

This is a reduction of a matmul, allowing much better optimization
(err wait, your code didn't show where temp was re-initialized, or what the c[] arg was for.  I just assumed it was global or something, but probably you actually butchered a normal matmul function that stores a separate temp after every inner iteration to c[] by taking that out.  In that case, OoO exec between separate middle loop iteration is relevant for medium-sized n.  But you don't show the scheduler / ROB sizes, and that's not something you can easily model; you need to actually benchmark.  So this section is probably only applicable to the question I invented, not what you meant to ask!)
Your loops appear to be summing the elements of a matmul result, but still structured like a matmul.  i.e. do a row x column dot product, but instead of storing that into an N x N result[...] matrix, you're just summing the result.
That amounts to summing every pairwise product of elements between two matrices.  Since we don't need to keep separate row x column dot products separate anymore, that enables a lot of optimizations!  (There's nothing special or ideal about this order of summation; other orders will have different but likely not worse rounding error.)
For example, you don't need to transpose b into Tb, just use b (unless it was naturally already transposed, in which case that's fine).  Your matrices are square so it doesn't matter at all.
Furthermore, you can simply load one or a couple elements from a[] and loop over Tb[], doing those products with FMA operations, one load per FMA, into 10 or 12 vector accumulators.  (Or of course cache-block this to loop over a contiguous part of Tb that can stay hot in L1d cache.)
That could approach Haswell's max FLOPS throughput of 2x 256-bit FMA per clock = 8 (float elements per YMM vector) x 2 FMAs / clock x 2 FLOP/FMA = 32 FLOP / clock.

how will unrolling affect the cycles per element count CPE

Answers (1)

More detailed version

Unrolling in a useful way

This is a reduction of a matmul, allowing much better optimization

Related Questions