Reputation: 247
I have a byte-sized number in register AL (8 bits), in 32-bit x86.
I need to exchange between bit 1 (the second from right) and bit 4 (the fifth from right) of the number in register AL.
For example
00010000B input
00000010B output
Upvotes: 4
Views: 2379
Reputation: 20057
One instruction answer (needs 256-byte LUT).
xlatb // needs ds:(e)bx to point to the LUT
One to two instructions answer.
(If you need to zero-extend al
, prefer movzx
to a different register if you can trash a temp reg.)
movzx eax, al // mask off extra bits -- perhaps not needed,
mov al, look_up_table[eax] // if upper bits are guaranteed to be zero
Three instructions using al
alone.
test al, 0x12
jpe skip ;; parity was even (aka the bits are the same)
xor al, 0x12 ;; toggle both bits
skip:
Theory of operation: bits need to be exchanged only when they differ. 0 changes to 1 and 1 changes to 0 by xoring them with 1. Both bits are affected at the same time.
The jump could be avoided using a cmovpo
or cmovpe
conditional move if you can assume a P6-compatible CPU like all modern x86. But in this case the sequence requires 4 instructions (or maybe only 3 if some register is known to contain a zero or the bit mask). cmov
isn't available in 8-bit operand-size, so we use movzx
to avoid partial-register problems and enable mov-elimination on Intel CPUs.
movzx edx, al
xor edx, 0x12 ; toggle the copy
test al, 0x12 ; test the original
cmovpo eax, edx ; use the flipped version if exactly 1 of 2 bits were set
PF is only set according to the low 8 bits of the result, so unless you want to horizontally XOR a wider integer down to 8 bits starting with something like mov edx, eax
/ shr edx, 16
/ xor edx, eax
, you can't use PF this way.
Alternatively one could opt for testing if a & mask
has only one bit set. That is done with expression (a== (a&-a))
(like BMI1 blsi
), or with (a & (a-1)) == 0
(like BMI1 blsr
). But both of those are true for a==0
as well as for a
having exactly 1 bit set.
If you actually have BMI1 blsr
, it sets CF if the input was zero, letting you distinguish zero bits set from 1 bit set.
pext rcx, rbx, rax ; rax = input, rbx = mask with 2 bits
xor rbx, rax ; toggled version (destroys mask)
test ecx, ecx ; compute parity flag from the 8 bottom bits
cmovpo rax, rbx ; use toggled version if only 1 bit was set
Here pext
will extract or compress the two bits pointed by the mask rbx
from the source eax
to the (two) least significant bits in the destination register rcx
. Since now the bits to be tested are in the proper area, they can be tested to set/clear PF.
Upvotes: 8
Reputation: 365832
Aki's answer using PF for test
/cmovpo
(odd parity) is probably best when the bits are in the low byte. Without that it's trickier. The general strategy of checking of both bits are the same or not is probably good, though, since x86 doesn't have good bitfield extract/insert instructions.
We can shift a copy so the bits line up, then XOR and isolate that bit. a ^ (a>>distance) & mask;
. That requires you to know the distance between the set bits in the mask, which is fine for known-constant masks but not for runtime-variable masks.
mov ecx, eax ; alternative: rorx ecx, eax, 13 BMI2 saves a mov
shr ecx, 13 ; tzcnt(0x10000) - tzcnt(0x00008)
xor ecx, eax ; diff the 2 bits we care about, garbage everywhere else
mov edx, eax
xor eax, 0x10008 ; toggled copy
test cl, 0x8 ; (or EDX if the low bit isn't in the low 8)
cmovne eax, edx ; take the flipped version if the bits weren't equal
This is 7 instructions, or 6 with BMI2 rorx
to copy-and-rotate. But it has good instruction-level parallelism and no partial-register stalls or false dependencies. Also avoids rcl
/rcr
which are slow, and very slow with count other than 1.
ror
with an immediate count is single-uop on modern Intel and AMD, but a count of 1 costs 2 uops on Intel because of its weird partial-FLAGS semantics updating OF and leaving others unmodified. See https://uops.info/ and https://agner.org/optimize/
Another possible approach is checking for a single bit being set in a & mask
. popcnt(a&mask) == 1
can work if you have it, otherwise tmp ^= (tmp >> 16);
and so on can set up for using PF. Although at that point it's just a variation on the version above, costing more instructions to mask before copying and shifting.
BMI1 blsr
/ blsi
have possibilities: they set CF depending on the source being zero, and ZF as usual according to the result being zero. So they can distinguish a&mask == 0
from a & mask
having exactly 1 bit set so it's zero after tmp & (tmp-1)
.
Unfortunately the straightforward way of using blsr
would need a CMOV that copies on ZF=1 && CF=0. There isn't a condition for that; cmova
checks CF=0 and ZF=0
. Also, it's 2 uops on recent Intel CPUs, since it has 4 inputs (reading CF and the rest of FLAGS (SPAZO group) since it needs FLAGS from both separately-renamed parts; that's what Intel since Broadwell does instead of partial-flag merge uops).
0x10008
could be another register mov ecx, eax
and ecx, 0x10008
mov edx, eax
xor eax, 0x10008 ; toggled copy
blsr ecx, ecx ; ZF=1 if the result is zero (0 or 1 bit set before), CF=1 if input was 0
cmovc edx, eax ; if CF=1, we always want the original value. Copy it so next cmov has both operands the same.
cmovz eax, edx ; copy if ZF=1
; There is no cmov?? that copies if ZF=1 && CF=0.
So EAX gets the toggled copy from EDX only if CF=0 (so we don't overwrite EDX) and ZF=1 (so we then copy to EAX).
This feels quite inefficient, but is the best I've thought of for runtime-variable 2-bit-set masks.
That brings this up to 7 uops, break-even with the shr/test way. (And on Intel Haswell and earlier, each cmov
is 2 uops so it's worse). The only upside here is working for a runtime-variable mask that we have in another register.
Any other single instruction would still be break-even at best, but I'm not sure what else could even work. e.g. on dec ecx
to affect ZF without touching CF? But not in a useful way. cmc
to flip CF wouldn't help, then we'd need to cmov on ZF=1 and CF=1
, but cmovbe
(also 2 uops) would check CF=1 or ZF=1
.
Upvotes: 1
Reputation: 4219
6 instructions using al alone
; al | cf
; 76543210 | x
ror al, 5 ; 43210765 | x
bt al, 4 ; 43210765 | 1
rcl al, 4 ; 07651432 | 1
bt al, 2 ; 07651432 | 4
rcr al, 3 ; 32407651 | 4
rol al, 4 ; 76513240 | x
Upvotes: 1
Reputation: 86403
I add my six instruction alternative:
xor bl, bl ; -> clear work register
btr ax, 1 ; -> second bit to carry, clear that bit in al
cmovc bl, 8 ; -> set bit at bitpos 3
btr ax, 4 ; -> fifth bit to carry, clear that bit in al
rcl bl, 2 ; -> set bit at bitpos 2, shift bitpos 3 -> 5
or al, bl ; -> merge bits
Note: This is just an academic exercise. You probably don't want code that use btr instructions because these are slow. At least last time I've tried to use them. Also: Untested.
Needs 486 instruction set.
Upvotes: 1
Reputation: 37242
Another alternative (7 instructions):
mov ebx,00010010b ;ebx = 00010010
and ebx,eax ;ebx = 000A00B0
lea ebx,[ebx+ebx*4] ;ebx = 0A0AB0B0
rol bl,5 ;ebx = AB0B00A0
and ebx,00010010b ;ebx = 000B00A0
and al,11101101b ;al = abc0ef0g
or al,bl ;al = abcBefAg
And another alternative (also 7 instructions):
ror al,1 ;ax = ????????.gabcAefB
ror ax,1 ;ax = B???????.?gabcAef
ror al,3 ;ax = B???????.Aef?gabc
rol ax,1 ;ax = ???????A.ef?gabcB
rol al,3 ;ax = ???????A.gabcBef?
ror al,1 ;ax = ????????.AgabcBef
rol al,2 ;ax = ????????.abcBefAg
Upvotes: 1
Reputation: 373
You may try this:
mov BL, AL
mov BH, AL
and BL, 2h //empty all bits except first
and BH, 10h //empty all bits except fourth
shl BL, 3 //move bit 1 to position of bit 4
shr BH, 3 //move bit 4 to position of bit 1
and AL, edh //empty first and fourth bits
or AL, BL //set bit 4
or AL, BH //set bit 1
AL register contains the result. Also you may need data stored in register BX. If you do then prepend solution with
push BX
end append to the end
pop BX
Upvotes: 5