If CPU is a binary machine, why is it slow on bit manipulations?

Question

I found that contrary to its binary / bi-state nature, x86 CPUs are very slow when processing binary manipulations instructions such as SHR, BT, BTR, ROL and something similar.

For example, I've read it from somewhere that bit shifting / rotate more than 1 positions is considered slow (with high-latency, performance penalty and those scary stuff). It's even worse when the operands are in memory (Aren't memory bi-state peripherals, too?)

shl eax,1  ;ok
shl eax,7  ;slow?

So what's making them slow? It's kind of ironic that binary machines like the CPUs are slow on bit manipulations when such operations are supposed to be natural. It gives the impression that a binary CPU is having a hard time shifting bits in place!

EDIT: Now after having a second look at SHL entry in the manual, it does involve some heavy microcode logics!

From Intel's vol.2 manual for shl...

Operation
TemporaryCount = Count & 0x1F;
TemporaryDestination = Destination;
while(TemporaryCount != 0) {
    if(Instruction == SAL || Instruction == SHL) {
        CF = MSB(Destination);
        Destination = Destination << 1;
    }
    //instruction is SAR or SHR
    else {
        CF = LSB(Destination);
        if(Instruction == SAR) Destination = Destination / 2; //Signed divide, rounding toward negative infinity
        //Instruction is SHR
        else Destination = Destination / 2; //Unsigned divide
    }
    TemporaryCount = TemporaryCount - 1;
}
//Determine overflow
if(Count & 0x1F == 1) {
    if(Instruction == SAL || Instruction == SHL) OF = MSB(Destination) ^ CF;
    else if(Instruction == SAR) OF = 0;
    //Instruction == SHR
    else OF = MSB(TemporaryDestination);
}
else OF = Undefined;

Unbelievable to see that such a simple boolean algebra is turned into an implementation nightmare.

fuz · Accepted Answer

That is just the pseudo-code for the instruction, specifying exactly what it does. The instruction is not actually implemented like this. In practice, all modern CPUs have barrel shifters or similar, allowing them to shift by arbitrary amounts in a single cycle. See for example Agner Fog's tables where it shows a latency of 1 for almost all bit-fiddling instructions.

A few bit-fiddling instructions are slower, here are some examples:

bt, btr, bts, and btc are slower when used with memory operands because of the (a) read-modify-write operation and (b) the bitstring indexing they do
rcr with a rotate amount of more than 1 is slow because this instruction is almost never needed and thus not optimised
pdepand pext are slightly slower on Intel and much slower on AMD, probably because their implementation is pretty involved and splitting the implementation up makes it easier.

On old processors (say, an 8086), the CPU would take as many cycles as the shift amount was, doing one shift every cycle. This kind of implementation allows the ALU to be used for shifting without any extra hardware, reducing the number of gates needed for the processor. No modern CPU I know of has this performance behaviour.

If CPU is a binary machine, why is it slow on bit manipulations?

Answers (2)

Related Questions