bagel_lord
bagel_lord

Reputation: 59

Why do clang and gcc duplicate code and branch vs unconditional jump inside loop?

Why do compilers seem to prefer to optimize pretest loops into a check, conditional jump over, then a do-while structure, versus doing an unconditional jump inside the the do while loop?

I wrote a function that is written in the second style I described, but both g++ and clang undo it and transform it into method one. https://godbolt.org/g/2Dvudi

I'm confused because it seems like the compiler duplicates a lot of instructions (maybe not that much for this example) for the pretest. Also, It may do a jump anyway (though maybe it is statically predicted to be not taken, and not a big deal in the average case), so why not just do an unconditional jump always?

Here is one thought I have about it, but It doesn't strongly support either method:
The loop wants to be aligned, so there may be room to duplicate some instructions upfront anyway without wasting space, since they'd be padded with nops. However, both clang and gcc emit code that exceeds 16-bytes for the pretest, and end up inserting a large nop afterwards.

Edit: Here is the code from the godbolt link:

typedef unsigned char uchar;

unsigned my_atoi(const uchar *p)//sentinel at end
{
  unsigned acm=0u;
  unsigned d;
  goto LEnter;
  do{
    acm =  acm*10u + d;
  LEnter:
    d = *p++ - '0';
  }while (d<10u);
  return acm;
}

clang 5.0 at -O2 emits:

my_atoi(unsigned char const*):                          # @my_atoi(unsigned char const*)
    movzx   ecx, byte ptr [rdi]
    add     ecx, -48
    xor     eax, eax
    cmp     ecx, 9
    ja      .LBB0_3
    inc     rdi
    xor     eax, eax
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
    lea     eax, [rax + 4*rax]
    lea     eax, [rcx + 2*rax]
    movzx   ecx, byte ptr [rdi]
    add     ecx, -48
    inc     rdi
    cmp     ecx, 10
    jb      .LBB0_2
.LBB0_3:
    ret

Upvotes: 3

Views: 658

Answers (2)

Surt
Surt

Reputation: 16109

Initializing d and removing the goto makes it stop being strange.

typedef unsigned char uchar;

unsigned my_atoi(const uchar *p) {//sentinel at end
  unsigned acm=0u;
  unsigned d=0; // initialized
//  goto LEnter;
  do{
    acm =  acm*10u + d;
//  LEnter:
    d = *p++ - '0';
  }while (d<10u);
  return acm;
}


 xor    eax,eax // acm=0
 xor    ecx,ecx // d=0
 data16 data16 nop WORD PTR cs:[rax+rax*1+0x0] // nops, aligns to 16 bytes
 lea    eax,[rax+rax*4] // *5
 lea    eax,[rcx+rax*2] // *2+d
 movzx  ecx,BYTE PTR [rdi] // d=*p
 inc    rdi // p++
 add    ecx,0xffffffd0 // d-'0'
 cmp    ecx,0xa // d<10
 jb     4004e0 <my_atoi(unsigned char const*)+0x10>
 ret    
 data16 nop WORD PTR cs:[rax+rax*1+0x0]
main:
 xor    eax,eax
 ret    
 nop    WORD PTR cs:[rax+rax*1+0x0]
 nop    DWORD PTR [rax]

So the reason the compiler double the check is that you would jump into a none-aligned code. The central loop looks small enough to be in the loop-buffer of the CPU, but jumping into the middle of it might pollute the loop-buffer.

Upvotes: 0

Mikhail Maltsev
Mikhail Maltsev

Reputation: 1678

Quoting some comments from GCC sources of a relevant optimization pass.

Duplicates headers of loops if they are small enough, so that the statements in the loop body are always executed when the loop is entered. This increases effectiveness of code motion optimizations, and reduces the need for loop preconditioning.

I.e., if the later passes find some loop-invariant code, they will have a place to move that code to without a need to add a check, whether the loop will iterate at all.

For all loops, copy the condition at the end of the loop body in front of the loop. This is beneficial since it increases efficiency of code motion optimizations. It also saves one jump on entry to the loop.

Upvotes: 1

Related Questions