Reputation: 7812

Why is adding a superfluous mask and bitshift more optimizable?

While writing an integer to hex string function I noticed that I had an unnecessary mask and bit shift, but when I removed it, the code actually got bigger (by about 8-fold)

char *i2s(int n){
    static char buf[(sizeof(int)<<1)+1]={0};
    int i=0;
    while(i<(sizeof(int)<<1)+1){    /* mask the ith hex, shift it to lsb */
//      buf[i++]='0'+(0xf&(n>>((sizeof(int)<<3)-i<<2))); /* less optimizable ??? */
        buf[i++]='0'+(0xf&((n&(0xf<<((sizeof(int)<<3)-i<<2)))>>((sizeof(int)<<3)-i<<2)));
        if(buf[i-1]>'9')buf[i-1]+=('A'-'0'-10); /* handle A-F */
    }
    for(i=0;buf[i++]=='0';)
        /*find first non-zero*/;
    return (char *)buf+i;
}

With the extra bit shift and mask and compiled with gcc -S -O3, the loops unroll and it reduces to:

    movb    $48, buf.1247
    xorl    %eax, %eax
    movb    $48, buf.1247+1
    movb    $48, buf.1247+2
    movb    $48, buf.1247+3
    movb    $48, buf.1247+4
    movb    $48, buf.1247+5
    movb    $48, buf.1247+6
    movb    $48, buf.1247+7
    movb    $48, buf.1247+8
    .p2align 4,,7
    .p2align 3
.L26:
    movzbl  buf.1247(%eax), %edx
    addl    $1, %eax
    cmpb    $48, %dl
    je  .L26
    addl    $buf.1247, %eax
    ret

Which is what I expected for 32 bit x86 (should be similar,but twice as many movb-like ops for 64bit); however without the seemingly redundant mask and bit shift, gcc can't seem to unroll and optimize it.

Any ideas why this would happen? I am guessing it has to do with gcc being (overly?) cautious with the sign bit. (There is no >>> operator in C, so bitshifting the MSB >> pads with 1s vs. 0s if the sign bit is set)

Upvotes: 4

Answers (2)

Fabio Fernandes

Reputation: 59

I think it has to do that in the shorter version, you are left shifting by ((sizeof(int)<<3)-i<<2) and then right-shifting by that same value later in the expression, so the compiler is able to optimised based on that fact.

Regarding the right-shifting, C++ can do the equivalent of both of Java's operators '>>' and '>>>'. It's just that in [GNU] C++ the result of "x >> y" will depend on whether x is signed or unsigned. If x is signed, then shift-right-arithmetic (SRA, sign-extending) is used, and if x is unsigned, then shift-right-logical (SRL, zero-extending) is used. This way, >> can be used to divide by 2 for both negative and positive numbers.

Unrolling loops is no longer a good idea because: 1) Newer processors come with a micro-op buffer which will often speed up small loops, 2) code bloat makes instruction caching less efficient by taking up more space in L1i. Micro-benchmarks will hide that effect.

The algorithm doesn't have to be that complicated. Also, your algorithm has a problem that it returns '0' for multiples of 16, and for 0 itself it returns an empty string.

Below is a rewrite of the algo which is branch free except for the loop exit check (or completely branch free if compiler decides to unroll it). It is faster, generates shorter code and fixes the multiple-of-16 bug.

Branch-free code is desirable because there is a big penaly (15-20 clock cycles) if the CPU mispredicts a branch. Compare that to the bit operations in the algo: they only take 1 clock cycle each and the CPU is able to execute 3 or 4 of them in the same clock cycle.

const char* i2s_brcfree(int n)
{
  static char buf[ sizeof(n)*2+1] = {0};
  unsigned int nibble_shifter = n;
  for(char* p = buf+sizeof(buf)-2; p >= buf; --p, nibble_shifter>>=4){
    const char curr_nibble = nibble_shifter & 0xF; // look only at lowest 4 bits
    char digit = '0' + curr_nibble;
    // "promote" to hex if nibble is over 9, 
    // conditionally adding the difference between ('0'+nibble) and 'A' 
    enum{ dec2hex_offset = ('A'-'0'-0xA) }; // compile time constant
    digit += dec2hex_offset & -(curr_nibble > 9); // conditional add
    *p = digit;
  }
  return buf;
}

Edit: C++ does not define the result of right shifting negative numbers. I only know that GCC and visual studio do that on the x86 architecture.

Upvotes: 2

Peter Cordes

Reputation: 365832

It seems you're using gcc4.7, since newer gcc versions generate different code than what you show.

gcc is able to see that your longer expression with the extra shifting and masking is always '0' + 0, but not for the shorter expression.

clang sees through them both, and optimizes them to a constant independent of the function arg n, so this is probably just a missed-optimization for gcc. When gcc or clang manage to optimize away the first loop to just storing a constant, the asm for the whole function never even references the function arg, n.

Obviously this means your function is buggy! And that's not the only bug.

off-by-one in the first loop, so you write 9 bytes, leaving no terminating 0. (Otherwise the search loop could optimize away to, and just return a pointer to the last byte. As written, it has to search off the end of the static array until it finds a non-'0' byte. Writing a 0 (not '0') before the search loop unfortunately doesn't help clang or gcc get rid of the search loop)
off-by-one in the search loop so you always return buf+1 or later because you used buf[i++] in the condition instead of a for() loop with the increment after the check.
undefined behaviour from using i++ and i in the same statement with no sequence point separating them.
Apparently assuming that CHAR_BIT is 8. (Something like static char buf[CHAR_BIT*sizeof(n)/4 + 1], but actually you need to round up when dividing by two).

clang and gcc both warn about - having lower precedence than <<, but I didn't try to find exactly where you went wrong. Getting the ith nibble of an integer is much simpler than you make it: buf[i]='0'+ (0x0f & (n >> (4*i))); That compiles to pretty clunky code, though. gcc probably does better with @Fabio's suggestion to do tmp >>= 4 repeatedly. If the compiler leaves that loop rolled up, it can still use shr reg, imm8 instead of needing a variable-shift. (clang and gcc don't seem to optimize the n>>(4*i) into repeated shifts by 4.)

In both cases, gcc is fully unrolling the first loop. It's quite large when each iteration includes actual shifting, comparing, and branching or branchless handling of hex digits from A to F.

It's quite small when it can see that all it has to do is store 48 == 0x30 == '0'. (Unfortunately, it doesn't coalesce the 9 byte stores into wider stores the way clang does).

I put a bugfixed version on godbolt, along with your original.

Fabio's answer has a more optimized version. I was just trying to figure out what gcc was doing with yours, since Fabio had already provided a good version that should compile to more efficient code. (I optimized mine a bit too, but didn't replace the n>>(4*i) with n>>=4.)

gcc6.3 makes very amusing code for your bigger expression. It unrolls the search loop and optimizes away some of the compares, but keeps a lot of the conditional branches!

i2s_orig:
    mov     BYTE PTR buf.1406+3, 48
    mov     BYTE PTR buf.1406, 48
    cmp     BYTE PTR buf.1406+3, 48
    mov     BYTE PTR buf.1406+1, 48
    mov     BYTE PTR buf.1406+2, 48
    mov     BYTE PTR buf.1406+4, 48
    mov     BYTE PTR buf.1406+5, 48
    mov     BYTE PTR buf.1406+6, 48
    mov     BYTE PTR buf.1406+7, 48
    mov     BYTE PTR buf.1406+8, 48
    mov     BYTE PTR buf.1406+9, 0
    jne     .L7    # testing flags from the compare earlier
    jne     .L8
    jne     .L9
    jne     .L10
    jne     .L11
    sete    al
    movzx   eax, al
    add     eax, 8
.L3:
    add     eax, OFFSET FLAT:buf.1406
    ret
.L7:
    mov     eax, 3
    jmp     .L3
 ... more of the same, setting eax to 4, or 5, etc.

Putting multiple jne instructions in a row is obviously useless.

Upvotes: 2

Why is adding a superfluous mask and bitshift more optimizable?

Answers (2)

Related Questions