assemblyx86bit-manipulationmicro-optimization

Reputation: 247

How to exchange between 2 bits in a 1-byte number

I have a byte-sized number in register AL (8 bits), in 32-bit x86.

I need to exchange between bit 1 (the second from right) and bit 4 (the fifth from right) of the number in register AL.

For example

00010000B    input
00000010B    output

Upvotes: 4

Answers (6)

Aki Suihkonen

Reputation: 20057

One instruction answer (needs 256-byte LUT).

xlatb  // needs ds:(e)bx to point to the LUT

One to two instructions answer.
(If you need to zero-extend al, prefer movzx to a different register if you can trash a temp reg.)

movzx eax, al                  // mask off extra bits -- perhaps not needed, 
mov   al, look_up_table[eax]   //  if upper bits are guaranteed to be zero

Three instructions using al alone.

  test al, 0x12
  jpe skip       ;; parity was even (aka the bits are the same)
  xor al,  0x12  ;; toggle both bits
skip:

Theory of operation: bits need to be exchanged only when they differ. 0 changes to 1 and 1 changes to 0 by xoring them with 1. Both bits are affected at the same time.

The jump could be avoided using a cmovpo or cmovpe conditional move if you can assume a P6-compatible CPU like all modern x86. But in this case the sequence requires 4 instructions (or maybe only 3 if some register is known to contain a zero or the bit mask). cmov isn't available in 8-bit operand-size, so we use movzx to avoid partial-register problems and enable mov-elimination on Intel CPUs.

 movzx   edx, al
 xor     edx, 0x12    ; toggle the copy
 test    al,  0x12    ; test the original
 cmovpo  eax, edx     ; use the flipped version if exactly 1 of 2 bits were set

With other masks, if either bit is outside the low 8, PF doesn't work

PF is only set according to the low 8 bits of the result, so unless you want to horizontally XOR a wider integer down to 8 bits starting with something like mov edx, eax / shr edx, 16 / xor edx, eax, you can't use PF this way.

Alternatively one could opt for testing if a & mask has only one bit set. That is done with expression (a== (a&-a)) (like BMI1 blsi), or with (a & (a-1)) == 0 (like BMI1 blsr). But both of those are true for a==0 as well as for a having exactly 1 bit set.

If you actually have BMI1 blsr, it sets CF if the input was zero, letting you distinguish zero bits set from 1 bit set.

With BMI2/PEXT

pext rcx, rbx, rax   ; rax = input, rbx = mask with 2 bits
xor  rbx, rax        ; toggled version (destroys mask)
test ecx, ecx        ; compute parity flag from the 8 bottom bits
cmovpo rax, rbx      ; use toggled version if only 1 bit was set

Here pext will extract or compress the two bits pointed by the mask rbx from the source eax to the (two) least significant bits in the destination register rcx. Since now the bits to be tested are in the proper area, they can be tested to set/clear PF.

Upvotes: 8

Peter Cordes

Reputation: 365832

Aki's answer using PF for test/cmovpo (odd parity) is probably best when the bits are in the low byte. Without that it's trickier. The general strategy of checking of both bits are the same or not is probably good, though, since x86 doesn't have good bitfield extract/insert instructions.

The general case: 2 bits not both in the low 8

We can shift a copy so the bits line up, then XOR and isolate that bit. a ^ (a>>distance) & mask;. That requires you to know the distance between the set bits in the mask, which is fine for known-constant masks but not for runtime-variable masks.

 mov     ecx, eax       ; alternative: rorx ecx, eax, 13  BMI2 saves a mov
 shr     ecx, 13        ; tzcnt(0x10000) - tzcnt(0x00008)
 xor     ecx, eax       ; diff the 2 bits we care about, garbage everywhere else

 mov     edx, eax
 xor     eax, 0x10008   ; toggled copy

 test    cl,  0x8       ; (or EDX if the low bit isn't in the low 8)
 cmovne  eax, edx       ; take the flipped version if the bits weren't equal

This is 7 instructions, or 6 with BMI2 rorx to copy-and-rotate. But it has good instruction-level parallelism and no partial-register stalls or false dependencies. Also avoids rcl/rcr which are slow, and very slow with count other than 1.

ror with an immediate count is single-uop on modern Intel and AMD, but a count of 1 costs 2 uops on Intel because of its weird partial-FLAGS semantics updating OF and leaving others unmodified. See https://uops.info/ and https://agner.org/optimize/

Another possible approach is checking for a single bit being set in a & mask. popcnt(a&mask) == 1 can work if you have it, otherwise tmp ^= (tmp >> 16); and so on can set up for using PF. Although at that point it's just a variation on the version above, costing more instructions to mask before copying and shifting.

BMI1 blsr / blsi have possibilities: they set CF depending on the source being zero, and ZF as usual according to the result being zero. So they can distinguish a&mask == 0 from a & mask having exactly 1 bit set so it's zero after tmp & (tmp-1).

Unfortunately the straightforward way of using blsr would need a CMOV that copies on ZF=1 && CF=0. There isn't a condition for that; cmova checks CF=0 and ZF=0. Also, it's 2 uops on recent Intel CPUs, since it has 4 inputs (reading CF and the rest of FLAGS (SPAZO group) since it needs FLAGS from both separately-renamed parts; that's what Intel since Broadwell does instead of partial-flag merge uops).

With a run-time variable mask: `0x10008` could be another register

 mov     ecx, eax
 and     ecx, 0x10008

 mov     edx, eax
 xor     eax, 0x10008   ; toggled copy

 blsr    ecx, ecx       ; ZF=1 if the result is zero (0 or 1 bit set before), CF=1 if input was 0
 cmovc  edx, eax     ; if CF=1, we always want the original value.  Copy it so next cmov has both operands the same.
 cmovz  eax, edx     ; copy if ZF=1

; There is no cmov?? that copies if ZF=1 && CF=0.

So EAX gets the toggled copy from EDX only if CF=0 (so we don't overwrite EDX) and ZF=1 (so we then copy to EAX).

This feels quite inefficient, but is the best I've thought of for runtime-variable 2-bit-set masks.

That brings this up to 7 uops, break-even with the shr/test way. (And on Intel Haswell and earlier, each cmov is 2 uops so it's worse). The only upside here is working for a runtime-variable mask that we have in another register.

Any other single instruction would still be break-even at best, but I'm not sure what else could even work. e.g. on dec ecx to affect ZF without touching CF? But not in a useful way. cmc to flip CF wouldn't help, then we'd need to cmov on ZF=1 and CF=1, but cmovbe (also 2 uops) would check CF=1 or ZF=1.

Upvotes: 1

panda-34

Reputation: 4219

6 instructions using al alone

           ; al       | cf
           ; 76543210 | x
ror al, 5  ; 43210765 | x
bt  al, 4  ; 43210765 | 1
rcl al, 4  ; 07651432 | 1
bt  al, 2  ; 07651432 | 4
rcr al, 3  ; 32407651 | 4
rol al, 4  ; 76513240 | x

Upvotes: 1

Nils Pipenbrinck

Reputation: 86403

I add my six instruction alternative:

xor   bl, bl  ; -> clear work register
btr   ax, 1   ; -> second bit to carry, clear that bit in al
cmovc bl, 8   ; -> set bit at bitpos 3
btr   ax, 4   ; -> fifth bit to carry, clear that bit in al
rcl   bl, 2   ; -> set bit at bitpos 2, shift bitpos 3 -> 5
or    al, bl  ; -> merge bits

Note: This is just an academic exercise. You probably don't want code that use btr instructions because these are slow. At least last time I've tried to use them. Also: Untested.

Needs 486 instruction set.

Upvotes: 1

Brendan

Reputation: 37242

Another alternative (7 instructions):

mov ebx,00010010b         ;ebx = 00010010
and ebx,eax               ;ebx = 000A00B0
lea ebx,[ebx+ebx*4]       ;ebx = 0A0AB0B0
rol bl,5                  ;ebx = AB0B00A0
and ebx,00010010b         ;ebx = 000B00A0
and al,11101101b          ;al = abc0ef0g
or  al,bl                 ;al = abcBefAg

And another alternative (also 7 instructions):

ror al,1           ;ax = ????????.gabcAefB
ror ax,1           ;ax = B???????.?gabcAef
ror al,3           ;ax = B???????.Aef?gabc
rol ax,1           ;ax = ???????A.ef?gabcB
rol al,3           ;ax = ???????A.gabcBef?
ror al,1           ;ax = ????????.AgabcBef
rol al,2           ;ax = ????????.abcBefAg

Upvotes: 1

FUT

Reputation: 373

You may try this:

mov BL, AL
mov BH, AL

and BL, 2h   //empty all bits except first
and BH, 10h  //empty all bits except fourth

shl BL, 3    //move bit 1 to position of bit 4
shr BH, 3    //move bit 4 to position of bit 1

and AL, edh  //empty first and fourth bits
or AL, BL    //set bit 4
or AL, BH    //set bit 1

AL register contains the result. Also you may need data stored in register BX. If you do then prepend solution with

push BX

end append to the end

pop BX

Upvotes: 5

How to exchange between 2 bits in a 1-byte number

Answers (6)

With other masks, if either bit is outside the low 8, PF doesn't work

With BMI2/PEXT

The general case: 2 bits not both in the low 8

With a run-time variable mask: 0x10008 could be another register

Related Questions

With a run-time variable mask: `0x10008` could be another register