Reputation: 155
I found that contrary to its binary / bi-state nature, x86 CPUs are very slow when processing binary manipulations instructions such as SHR, BT, BTR, ROL and something similar.
For example, I've read it from somewhere that bit shifting / rotate more than 1 positions is considered slow (with high-latency, performance penalty and those scary stuff). It's even worse when the operands are in memory (Aren't memory bi-state peripherals, too?)
shl eax,1 ;ok
shl eax,7 ;slow?
So what's making them slow? It's kind of ironic that binary machines like the CPUs are slow on bit manipulations when such operations are supposed to be natural. It gives the impression that a binary CPU is having a hard time shifting bits in place!
EDIT: Now after having a second look at SHL entry in the manual, it does involve some heavy microcode logics!
From Intel's vol.2 manual for shl
...
Operation
TemporaryCount = Count & 0x1F;
TemporaryDestination = Destination;
while(TemporaryCount != 0) {
if(Instruction == SAL || Instruction == SHL) {
CF = MSB(Destination);
Destination = Destination << 1;
}
//instruction is SAR or SHR
else {
CF = LSB(Destination);
if(Instruction == SAR) Destination = Destination / 2; //Signed divide, rounding toward negative infinity
//Instruction is SHR
else Destination = Destination / 2; //Unsigned divide
}
TemporaryCount = TemporaryCount - 1;
}
//Determine overflow
if(Count & 0x1F == 1) {
if(Instruction == SAL || Instruction == SHL) OF = MSB(Destination) ^ CF;
else if(Instruction == SAR) OF = 0;
//Instruction == SHR
else OF = MSB(TemporaryDestination);
}
else OF = Undefined;
Unbelievable to see that such a simple boolean algebra is turned into an implementation nightmare.
Upvotes: 2
Views: 494
Reputation: 3630
Just a note.
shl eax,1 ; opcode: d1 e0
shl eax,7 ; opcode: c1 e0 07
are actually different instructions with different opcodes which are handled potentially by different logic blocks of ALU. They use the same mnemonic in assembly and that can confuse, but from the viewpoint of CPU, they are different instructions with different opcodes and encodings.
Upvotes: 2
Reputation: 93044
That is just the pseudo-code for the instruction, specifying exactly what it does. The instruction is not actually implemented like this. In practice, all modern CPUs have barrel shifters or similar, allowing them to shift by arbitrary amounts in a single cycle. See for example Agner Fog's tables where it shows a latency of 1 for almost all bit-fiddling instructions.
A few bit-fiddling instructions are slower, here are some examples:
bt
, btr
, bts
, and btc
are slower when used with memory operands because of the (a) read-modify-write operation and (b) the bitstring indexing they dorcr
with a rotate amount of more than 1
is slow because this instruction is almost never needed and thus not optimisedpdep
and pext
are slightly slower on Intel and much slower on AMD, probably because their implementation is pretty involved and splitting the implementation up makes it easier.On old processors (say, an 8086), the CPU would take as many cycles as the shift amount was, doing one shift every cycle. This kind of implementation allows the ALU to be used for shifting without any extra hardware, reducing the number of gates needed for the processor. No modern CPU I know of has this performance behaviour.
Upvotes: 9