Reputation: 137
So I have seen the questions like toggle a bit at ith positon and How do you set, clear, and toggle a single bit?, but I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?
I tried writing it in C and looking through the assembly and don't quite understand exactly why there are some things that are there.
C:
unsigned long toggle(unsigned long num, unsigned long bit)
{
num ^= 1 << bit;
return num;
}
int main()
{
printf("%ld\n", toggle(100, 60));
return 0;
}
Toggle function assembly from GDB:
<toggle>
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-0x8],rdi
mov QWORD PTR [rbp-0x10],rsi
mov rax, QWORD PTR [rbp-0x10]
mov edx, 0x1
mov ecx, eax
shl edx, cl
mov eax, edx
cdqe
xor QWORD PTR [rbp-0x8],rax
mov rax, QWORD PTR [rbp-0x8]
pop rbp
ret
Can someone walk me through what's going on on the assembly level so I can better understand this and write my own toggle function in x86-64?
Upvotes: 3
Views: 5082
Reputation: 364997
I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?
Yes, x86's BTC
(Bit Test and Complement) instruction does exactly that (as well as setting CF to the old value of the bit), and runs efficiently on all modern CPUs.
Source: Agner Fog's instruction tables and x86 optimization guide. See also other performance links in the x86 tag wiki.
toggle:
mov rax, rdi
btc rax, rsi
ret
(If you'd written toggle
correctly in C).
Don't use btc
with a memory operand: the bit-string instructions have insane CISC semantics where the bit-index isn't limited to within the dword selected by the addressing mode. (So btc m,r
is 10 uops with one per 5c throughput on Skylake). But with a register operand, the shift-count is masked exactly like variable-count shifts.
Unfortunately gcc and clang miss this peephole optimization, even with -march=haswell
or -mtune=intel
. It's worth using even on AMD, but it's even more efficient on Intel.
1ULL << bit
with multiple inputsOn AMD CPUs where btc
is slower than xor
, it's worth generating the mask in a register and using xor
. Or even on Intel CPUs, this is worth it to toggle a bit in memory. (memory-destination xor
is much better than memory-destination btc
).
For multiple elements in an array, use SSE2 pxor
. You can generate the mask with:
pcmpeqd xmm0, xmm0 ; -1 all bits set
psrlq xmm0, 63 ; 1 just a single bit set
movd xmm1, esi
psllq xmm0, xmm1 ; 1<<bit
; then inside a loop, with data in xmm1
pxor xmm1, xmm0 ; flip bit in each qword element
don't quite understand exactly why there are some things that are there.
All that crap is there because you compiled without optimization, and because you used a signed int
constant.
It's not even worth looking at all the spill/reload to memory from the -O0
code. Compile with -O3 -march=native
if you want code that doesn't suck.
See also How to remove "noise" from GCC/clang assembly output?, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for a good intro to looking at compiler-generated asm.
Using the signed int
constant 1 << bit
explains why gcc did a 32-bit shift and then cdqe
. num ^= 1 << bit;
is equivalent to
int mask = 1;
mask <<= bit; // still signed int
num ^= mask; // mask is sign-extended to 64-bit here.
In gcc -O3 output, we get
mov edx, 1
sal edx, cl # 1<<bit (32-bit)
movsx rax, edx # sign-extend, like cdqe does for eax->rax
xor rax, rdi
If we write toggle
corectly:
uint64_t toggle64(uint64_t num, uint32_t bit) {
num ^= 1ULL << bit;
return num;
}
(source+asm on the Godbolt compiler explorer)
gcc and clang still miss using btc
, but it's not horrible. Interestingly, MSVC does spot the btc
peephole, but wastes a MOV instruction:
toggle64 PROC
mov eax, edx
btc rcx, rax
mov rax, rcx
ret 0
Using uint64_t
bit avoids the extra MOV. It's unnecessary because btc
with a register destination masks the index with & 63
. High garbage is not a problem, but MSVC doesn't know this.
gcc and clang emit code like you'd expect, but with gcc wasting a MOV instruction by generating 1ULL <<bit
in rdx
and having to copy to rax
.
; clang output.
mov eax, 1
mov ecx, esi
shl rax, cl
xor rax, rdi
ret
Upvotes: 4