Reputation: 59
Why do compilers seem to prefer to optimize pretest loops into a check, conditional jump over, then a do-while structure, versus doing an unconditional jump inside the the do while loop?
I wrote a function that is written in the second style I described, but both g++ and clang undo it and transform it into method one. https://godbolt.org/g/2Dvudi
I'm confused because it seems like the compiler duplicates a lot of instructions (maybe not that much for this example) for the pretest. Also, It may do a jump anyway (though maybe it is statically predicted to be not taken, and not a big deal in the average case), so why not just do an unconditional jump always?
Here is one thought I have about it, but It doesn't strongly support either method:
The loop wants to be aligned, so there may be room to duplicate some instructions upfront anyway without wasting space, since they'd be padded with nops. However, both clang and gcc emit code that exceeds 16-bytes for the pretest, and end up inserting a large nop afterwards.
Edit: Here is the code from the godbolt link:
typedef unsigned char uchar;
unsigned my_atoi(const uchar *p)//sentinel at end
{
unsigned acm=0u;
unsigned d;
goto LEnter;
do{
acm = acm*10u + d;
LEnter:
d = *p++ - '0';
}while (d<10u);
return acm;
}
clang 5.0 at -O2 emits:
my_atoi(unsigned char const*): # @my_atoi(unsigned char const*)
movzx ecx, byte ptr [rdi]
add ecx, -48
xor eax, eax
cmp ecx, 9
ja .LBB0_3
inc rdi
xor eax, eax
.LBB0_2: # =>This Inner Loop Header: Depth=1
lea eax, [rax + 4*rax]
lea eax, [rcx + 2*rax]
movzx ecx, byte ptr [rdi]
add ecx, -48
inc rdi
cmp ecx, 10
jb .LBB0_2
.LBB0_3:
ret
Upvotes: 3
Views: 658
Reputation: 16109
Initializing d
and removing the goto
makes it stop being strange.
typedef unsigned char uchar;
unsigned my_atoi(const uchar *p) {//sentinel at end
unsigned acm=0u;
unsigned d=0; // initialized
// goto LEnter;
do{
acm = acm*10u + d;
// LEnter:
d = *p++ - '0';
}while (d<10u);
return acm;
}
xor eax,eax // acm=0
xor ecx,ecx // d=0
data16 data16 nop WORD PTR cs:[rax+rax*1+0x0] // nops, aligns to 16 bytes
lea eax,[rax+rax*4] // *5
lea eax,[rcx+rax*2] // *2+d
movzx ecx,BYTE PTR [rdi] // d=*p
inc rdi // p++
add ecx,0xffffffd0 // d-'0'
cmp ecx,0xa // d<10
jb 4004e0 <my_atoi(unsigned char const*)+0x10>
ret
data16 nop WORD PTR cs:[rax+rax*1+0x0]
main:
xor eax,eax
ret
nop WORD PTR cs:[rax+rax*1+0x0]
nop DWORD PTR [rax]
So the reason the compiler double the check is that you would jump into a none-aligned code. The central loop looks small enough to be in the loop-buffer of the CPU, but jumping into the middle of it might pollute the loop-buffer.
Upvotes: 0
Reputation: 1678
Quoting some comments from GCC sources of a relevant optimization pass.
Duplicates headers of loops if they are small enough, so that the statements in the loop body are always executed when the loop is entered. This increases effectiveness of code motion optimizations, and reduces the need for loop preconditioning.
I.e., if the later passes find some loop-invariant code, they will have a place to move that code to without a need to add a check, whether the loop will iterate at all.
For all loops, copy the condition at the end of the loop body in front of the loop. This is beneficial since it increases efficiency of code motion optimizations. It also saves one jump on entry to the loop.
Upvotes: 1