Reputation: 965
The cpu in my testbed is "Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz", so it's Ivy Bridge and the third generation intel core process.
I use the testp provided by Agner Fog to do an analysis.
According to the following test result, if I align the asm code, the front stall cycles would be much small, but the number of instructs or uops would increase. and totally the core cycles are similar.
So what should I do to optimize the asm code? or where I am wrong?
Please pay attention to the last column. As I understanding, it means the front stall, so I try to reduce by using ".align 16" to get a better performance, you can see I use three .align, and RESOURCE_STALL.ANY is almost null.
If I don't add .align 16, the test result:
Clock BrTaken Core cyc Instruct Uops Uops F.D. rs any
60763677 49076140 66968169 229533042 164009900 164144552 17506254
59512474 49076140 66856747 229533041 164003819 164133385 17481613
If I align the asm code, the test result:
Clock BrTaken Core cyc Instruct Uops Uops F.D. rs any
73537615 49076140 82188535 311530045 246006910 246064159 3
74063027 49076140 82217717 311530045 246002615 246067520 4
If I remove the 2nd .align and move the 3nd before the label, the result:
Clock BrTaken Core cyc Instruct Uops rs any
59930621 49076358 66898617 245995143 180472213 17440098
59596305 49076140 66886858 245993040 180461441 17487217
The information of the counter:
event mask comment
BrTanken 0xa2 0x01 branches taken
Uops 0xc2 0x01 uops retired, unfused domain.
Uops F.D 0x0E 0x01 uops, fused domain, Sandy Bridge.
rs any 0xa2 0x01 Cycles Allocation are stalled due to Resource Related reason.
Instruct 0x40000000 Instructions (reference counter)
the asm code, I get it from gcc O1.
.section .text
.globl sum_sort_O1
.type sum_sort_O1, @function
sum_sort_O1:
movl $100000, %esi
movl $0, %eax
jmp .L2
#.align 16 --- 1
.L6:
movl (%rdi,%rdx), %ecx
cmpl $127, %ecx
jle .L3
#.align 16 --- 2
movslq %ecx, %rcx
addq %rcx, %rax
.L3:
#.align 16 --- 3
addq $4, %rdx
cmpq $131072, %rdx
jne .L6
subl $1, %esi
je .L5
.L2:
movl $0, %edx
jmp .L6
.L5:
rep ret
The c code:
#define arraySize 32768
long long sort(int* data)
{
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
}
============================================================== Inter Performance-Monitor Event
========================================================================= Optimization
I'm reading the document "64-ia-32-architectures-optimization-manual.pdf" and I find a method "TOP-DOWN ANALYSIS METHOD" in the section Appendix B.1. And I try to optimize this section of codes with the method.
The first problem which I met is the branch misprediction, and we could sort the data to resolve it.
The second problem is this one, the front stall and reource_stall is very big. If I add .align, the stall is much reduced, but the totally time is not reduce. So I think my direction is right but my method is not good, that is why I list the problem here to ask some idea.
I know optimize this section of code is a big work, I try to optimize it step by step with the method "TOP-DOWN ANALYSIS METHOD".
========================================================================== Back Ground
I'm interesting in the question. "Why is it faster to process a sorted array than an unsorted array?"
Yes, the problem which the author met is Branch Prediction.
And I want to optimize the code, when the Branch Prediction has been resolved.
Upvotes: 2
Views: 161
Reputation: 64925
Alignment of loops0 may help performance on some architectures in some scenarios, but it may not. In particular, when you omit an explicit directive it may be that the natural alignment is just fine1.
How likely this is depends on why alignment matters: let's say you suffer a penalty if your loop body crosses a 16-byte boundary within the first 5 bytes of instructions from the top: even if you don't force alignment, this only has a ~27% chance of happening if the entry point is randomly distributed, so in this case alignment usually makes no difference.
Even if alignment removes a penalty in one part of the pipeline (usually from the front-end), it may very well be that that component isn't the overall bottleneck, so avoiding the penalty doesn't improve the final performance.
Newer x86 CPUs have also in some ways reduced the impact of code alignment by introducing things like the uop cache, and having overall better front-end performance (e.g., decoding 5 instructions/cycle on Skylake) which makes it less likely that the front-end is a bottleneck.
More to the point, in your particular example, the last two out of the three .align
locations are internal to the main loop! This means that the instructions executed for alignment will may be executed frequently, meaning that alignment can have a large downside that may more than cancel out any gains. You want alignment only in cases where the alignment padding is only rarely executed, such as before a hot loop - but not inside it! This problem is why your instruction count jumps by a large amount in the "aligned" case.
Anyways, alignment will have little effect on an algorithm like this one with dense non-loop branching, unless all those branches are well predicted: the branch misprediction costs will dominate instead.
0 Generally speaking this refers to aligning the first instruction in a loop, aka the "loop top", but as it turns out other types of alignment may also be useful (these other types are never implemented in compilers, to my knowledge).
1 This is in contrast to say architectures which suffer a penalty on mis-aligned memory access. There it is often the case that only "exactly aligned" accesses run at full speed, but anything else is misaligned and suffers a penalty (and there may be even larger penalties for special misaligned cases, e.g., cache-line or page crossing access). Alignment for code generally doesn't work like that: alignment is only a heuristic that generally helps avoid some possible penalty, but there may be may other "not aligned" configurations that also avoid the penalty.
Upvotes: 4