Toggle a Specific Bit

Question

So I have seen the questions like toggle a bit at ith positon and How do you set, clear, and toggle a single bit?, but I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?

I tried writing it in C and looking through the assembly and don't quite understand exactly why there are some things that are there.

C:

unsigned long toggle(unsigned long num, unsigned long bit)
{
  num ^= 1 << bit;
  return num;
}

int main()
{
  printf("%ld
", toggle(100, 60));
  return 0;
}

Toggle function assembly from GDB:


push rbp
mov rbp, rsp
mov QWORD PTR [rbp-0x8],rdi
mov QWORD PTR [rbp-0x10],rsi
mov rax, QWORD PTR [rbp-0x10]
mov edx, 0x1
mov ecx, eax
shl edx, cl
mov eax, edx
cdqe
xor QWORD PTR [rbp-0x8],rax
mov rax, QWORD PTR [rbp-0x8]
pop rbp
ret

Can someone walk me through what's going on on the assembly level so I can better understand this and write my own toggle function in x86-64?

Peter Cordes · Accepted Answer

I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?

Yes, x86's BTC (Bit Test and Complement) instruction does exactly that (as well as setting CF to the old value of the bit), and runs efficiently on all modern CPUs.

Intel SnB-family: 1 uop, 1c latency, 2 per clock throughput. (Nehalem and earlier: 1 per clock)
Silvermont/KNL: 1 uop, 1c latency, 1 per clock throughput.
AMD Ryzen: 2 uops, 2c latency, 2 per clock throughput
AMD Bulldozer-family / Jaguar: 2 uops, 2c latency, 1 per clock throughput
AMD K8/K10: 2 uops, 2c latency, 1 per clock throughput

Source: Agner Fog's instruction tables and x86 optimization guide. See also other performance links in the x86 tag wiki.

toggle:
    mov  rax, rdi
    btc  rax, rsi
    ret

(If you'd written toggle correctly in C).

Don't use btc with a memory operand: the bit-string instructions have insane CISC semantics where the bit-index isn't limited to within the dword selected by the addressing mode. (So btc m,r is 10 uops with one per 5c throughput on Skylake). But with a register operand, the shift-count is masked exactly like variable-count shifts.

Unfortunately gcc and clang miss this peephole optimization, even with -march=haswell or -mtune=intel. It's worth using even on AMD, but it's even more efficient on Intel.

Repeated use of the same `1ULL << bit` with multiple inputs

On AMD CPUs where btc is slower than xor, it's worth generating the mask in a register and using xor. Or even on Intel CPUs, this is worth it to toggle a bit in memory. (memory-destination xor is much better than memory-destination btc).

For multiple elements in an array, use SSE2 pxor. You can generate the mask with:

pcmpeqd  xmm0, xmm0        ; -1 all bits set
psrlq    xmm0, 63          ;  1 just a single bit set

movd     xmm1, esi
psllq    xmm0, xmm1        ; 1<






  don't quite understand exactly why there are some things that are there.


All that crap is there because you compiled without optimization, and because you used a signed int constant.



It's not even worth looking at all the spill/reload to memory from the -O0 code.  Compile with -O3 -march=native if you want code that doesn't suck.

See also How to remove "noise" from GCC/clang assembly output?, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for a good intro to looking at compiler-generated asm.



Using the signed int constant 1 << bit explains why gcc did a 32-bit shift and then cdqe.  num ^= 1 << bit; is equivalent to

int mask = 1;
mask <<= bit;   // still signed int
num ^= mask;    // mask is sign-extended to 64-bit here.


In gcc -O3 output, we get

    mov     edx, 1
    sal     edx, cl           # 1<rax
    xor     rax, rdi




If we write toggle corectly:

uint64_t toggle64(uint64_t num, uint32_t bit) {
  num ^= 1ULL << bit;
  return num;
}


(source+asm on the Godbolt compiler explorer)

gcc and clang still miss using btc, but it's not horrible.  Interestingly, MSVC does spot the btc peephole, but wastes a MOV instruction:

toggle64 PROC
    mov      eax, edx
    btc      rcx, rax
    mov      rax, rcx
    ret      0


Using uint64_t bit avoids the extra MOV.  It's unnecessary because btc with a register destination masks the index with & 63.  High garbage is not a problem, but MSVC doesn't know this.

gcc and clang emit code like you'd expect, but with gcc wasting a MOV instruction by generating 1ULL < in rdx and having to copy to rax.


 ; clang output.
    mov     eax, 1
    mov     ecx, esi
    shl     rax, cl
    xor     rax, rdi
    ret

Toggle a Specific Bit

Answers (1)

Repeated use of the same `1ULL << bit` with multiple inputs

Related Questions

Toggle a Specific Bit

Answers (1)

Repeated use of the same 1ULL << bit with multiple inputs

Related Questions

Repeated use of the same `1ULL << bit` with multiple inputs