Ethan
Ethan

Reputation: 137

Toggle a Specific Bit

So I have seen the questions like toggle a bit at ith positon and How do you set, clear, and toggle a single bit?, but I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?

I tried writing it in C and looking through the assembly and don't quite understand exactly why there are some things that are there.

C:

unsigned long toggle(unsigned long num, unsigned long bit)
{
  num ^= 1 << bit;
  return num;
}

int main()
{
  printf("%ld\n", toggle(100, 60));
  return 0;
}

Toggle function assembly from GDB:

<toggle>
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-0x8],rdi
mov QWORD PTR [rbp-0x10],rsi
mov rax, QWORD PTR [rbp-0x10]
mov edx, 0x1
mov ecx, eax
shl edx, cl
mov eax, edx
cdqe
xor QWORD PTR [rbp-0x8],rax
mov rax, QWORD PTR [rbp-0x8]
pop rbp
ret

Can someone walk me through what's going on on the assembly level so I can better understand this and write my own toggle function in x86-64?

Upvotes: 3

Views: 5082

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364997

I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?

Yes, x86's BTC (Bit Test and Complement) instruction does exactly that (as well as setting CF to the old value of the bit), and runs efficiently on all modern CPUs.

  • Intel SnB-family: 1 uop, 1c latency, 2 per clock throughput. (Nehalem and earlier: 1 per clock)
  • Silvermont/KNL: 1 uop, 1c latency, 1 per clock throughput.
  • AMD Ryzen: 2 uops, 2c latency, 2 per clock throughput
  • AMD Bulldozer-family / Jaguar: 2 uops, 2c latency, 1 per clock throughput
  • AMD K8/K10: 2 uops, 2c latency, 1 per clock throughput

Source: Agner Fog's instruction tables and x86 optimization guide. See also other performance links in the tag wiki.

toggle:
    mov  rax, rdi
    btc  rax, rsi
    ret

(If you'd written toggle correctly in C).

Don't use btc with a memory operand: the bit-string instructions have insane CISC semantics where the bit-index isn't limited to within the dword selected by the addressing mode. (So btc m,r is 10 uops with one per 5c throughput on Skylake). But with a register operand, the shift-count is masked exactly like variable-count shifts.

Unfortunately gcc and clang miss this peephole optimization, even with -march=haswell or -mtune=intel. It's worth using even on AMD, but it's even more efficient on Intel.


Repeated use of the same 1ULL << bit with multiple inputs

On AMD CPUs where btc is slower than xor, it's worth generating the mask in a register and using xor. Or even on Intel CPUs, this is worth it to toggle a bit in memory. (memory-destination xor is much better than memory-destination btc).

For multiple elements in an array, use SSE2 pxor. You can generate the mask with:

pcmpeqd  xmm0, xmm0        ; -1 all bits set
psrlq    xmm0, 63          ;  1 just a single bit set

movd     xmm1, esi
psllq    xmm0, xmm1        ; 1<<bit


; then inside a loop, with data in xmm1
pxor     xmm1, xmm0        ; flip bit in each qword element

don't quite understand exactly why there are some things that are there.

All that crap is there because you compiled without optimization, and because you used a signed int constant.


It's not even worth looking at all the spill/reload to memory from the -O0 code. Compile with -O3 -march=native if you want code that doesn't suck.

See also How to remove "noise" from GCC/clang assembly output?, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for a good intro to looking at compiler-generated asm.


Using the signed int constant 1 << bit explains why gcc did a 32-bit shift and then cdqe. num ^= 1 << bit; is equivalent to

int mask = 1;
mask <<= bit;   // still signed int
num ^= mask;    // mask is sign-extended to 64-bit here.

In gcc -O3 output, we get

    mov     edx, 1
    sal     edx, cl           # 1<<bit   (32-bit)
    movsx   rax, edx          # sign-extend, like cdqe does for eax->rax
    xor     rax, rdi

If we write toggle corectly:

uint64_t toggle64(uint64_t num, uint32_t bit) {
  num ^= 1ULL << bit;
  return num;
}

(source+asm on the Godbolt compiler explorer)

gcc and clang still miss using btc, but it's not horrible. Interestingly, MSVC does spot the btc peephole, but wastes a MOV instruction:

toggle64 PROC
    mov      eax, edx
    btc      rcx, rax
    mov      rax, rcx
    ret      0

Using uint64_t bit avoids the extra MOV. It's unnecessary because btc with a register destination masks the index with & 63. High garbage is not a problem, but MSVC doesn't know this.

gcc and clang emit code like you'd expect, but with gcc wasting a MOV instruction by generating 1ULL <<bit in rdx and having to copy to rax.

 ; clang output.
    mov     eax, 1
    mov     ecx, esi
    shl     rax, cl
    xor     rax, rdi
    ret

Upvotes: 4

Related Questions