Why is the cycle count of this loop similar in the aligned and unaliged cases?

Question

The cpu in my testbed is "Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz", so it's Ivy Bridge and the third generation intel core process.

I use the testp provided by Agner Fog to do an analysis.

According to the following test result, if I align the asm code, the front stall cycles would be much small, but the number of instructs or uops would increase. and totally the core cycles are similar.

So what should I do to optimize the asm code? or where I am wrong?

Please pay attention to the last column. As I understanding, it means the front stall, so I try to reduce by using ".align 16" to get a better performance, you can see I use three .align, and RESOURCE_STALL.ANY is almost null.

If I don't add .align 16, the test result:

  Clock    BrTaken   Core cyc   Instruct    Uops        Uops F.D.    rs any
60763677   49076140   66968169  229533042  164009900   164144552   17506254 
59512474   49076140   66856747  229533041  164003819   164133385   17481613

If I align the asm code, the test result:

  Clock    BrTaken   Core cyc   Instruct     Uops       Uops F.D.     rs any 
73537615   49076140   82188535  311530045  246006910   246064159        3 
74063027   49076140   82217717  311530045  246002615   246067520        4

If I remove the 2nd .align and move the 3nd before the label, the result:

     Clock    BrTaken   Core cyc   Instruct       Uops     rs any 
  59930621   49076358   66898617  245995143  180472213   17440098 
  59596305   49076140   66886858  245993040  180461441   17487217

The information of the counter:

              event mask   comment
   BrTanken   0xa2  0x01   branches taken
   Uops       0xc2  0x01   uops retired, unfused domain.
   Uops F.D   0x0E  0x01   uops, fused domain, Sandy Bridge.
   rs any     0xa2  0x01   Cycles Allocation are stalled due to Resource Related reason.
   Instruct   0x40000000   Instructions (reference counter)

the asm code, I get it from gcc O1.

.section .text
.globl sum_sort_O1
.type sum_sort_O1, @function
sum_sort_O1:
        movl    $100000, %esi
        movl    $0, %eax
        jmp     .L2 
#.align 16 --- 1
.L6:
        movl    (%rdi,%rdx), %ecx
        cmpl    $127, %ecx
        jle     .L3 
#.align 16 --- 2
        movslq  %ecx, %rcx
        addq    %rcx, %rax
.L3:
#.align 16 --- 3
        addq    $4, %rdx
        cmpq    $131072, %rdx
        jne     .L6 
        subl    $1, %esi
        je      .L5 
.L2:
        movl    $0, %edx
        jmp     .L6 
.L5:
        rep ret

The c code:

#define  arraySize  32768

long long sort(int* data)
{
    long long sum = 0;
     for (unsigned i = 0; i < 100000; ++i)
    {   
        for (unsigned c = 0; c < arraySize; ++c)
        {   
            if (data[c] >= 128)
                sum += data[c];
        }   
    }   
}

============================================================== Inter Performance-Monitor Event

========================================================================= Optimization

I'm reading the document "64-ia-32-architectures-optimization-manual.pdf" and I find a method "TOP-DOWN ANALYSIS METHOD" in the section Appendix B.1. And I try to optimize this section of codes with the method.

The first problem which I met is the branch misprediction, and we could sort the data to resolve it.

The second problem is this one, the front stall and reource_stall is very big. If I add .align, the stall is much reduced, but the totally time is not reduce. So I think my direction is right but my method is not good, that is why I list the problem here to ask some idea.

I know optimize this section of code is a big work, I try to optimize it step by step with the method "TOP-DOWN ANALYSIS METHOD".

========================================================================== Back Ground

I'm interesting in the question. "Why is it faster to process a sorted array than an unsorted array?"

Yes, the problem which the author met is Branch Prediction.

And I want to optimize the code, when the Branch Prediction has been resolved.

BeeOnRope · Accepted Answer

Alignment of loops⁰ may help performance on some architectures in some scenarios, but it may not. In particular, when you omit an explicit directive it may be that the natural alignment is just fine¹.

How likely this is depends on why alignment matters: let's say you suffer a penalty if your loop body crosses a 16-byte boundary within the first 5 bytes of instructions from the top: even if you don't force alignment, this only has a ~27% chance of happening if the entry point is randomly distributed, so in this case alignment usually makes no difference.

Even if alignment removes a penalty in one part of the pipeline (usually from the front-end), it may very well be that that component isn't the overall bottleneck, so avoiding the penalty doesn't improve the final performance.

Newer x86 CPUs have also in some ways reduced the impact of code alignment by introducing things like the uop cache, and having overall better front-end performance (e.g., decoding 5 instructions/cycle on Skylake) which makes it less likely that the front-end is a bottleneck.

More to the point, in your particular example, the last two out of the three .align locations are internal to the main loop! This means that the instructions executed for alignment will may be executed frequently, meaning that alignment can have a large downside that may more than cancel out any gains. You want alignment only in cases where the alignment padding is only rarely executed, such as before a hot loop - but not inside it! This problem is why your instruction count jumps by a large amount in the "aligned" case.

Anyways, alignment will have little effect on an algorithm like this one with dense non-loop branching, unless all those branches are well predicted: the branch misprediction costs will dominate instead.

⁰ Generally speaking this refers to aligning the first instruction in a loop, aka the "loop top", but as it turns out other types of alignment may also be useful (these other types are never implemented in compilers, to my knowledge).

¹ This is in contrast to say architectures which suffer a penalty on mis-aligned memory access. There it is often the case that only "exactly aligned" accesses run at full speed, but anything else is misaligned and suffers a penalty (and there may be even larger penalties for special misaligned cases, e.g., cache-line or page crossing access). Alignment for code generally doesn't work like that: alignment is only a heuristic that generally helps avoid some possible penalty, but there may be may other "not aligned" configurations that also avoid the penalty.

Why is the cycle count of this loop similar in the aligned and unaliged cases?

Answers (1)

Related Questions