John
John

Reputation: 23

How is a 15 bytes instruction transferred form memory to CPU?

Assuming we are using a x86-64 machine, it means it's general registers are 64 bits long, its data bus can handle 64 bits at a time, its ALU can handle at max 64 bit number (right?).

Having a simple instruction like

MOV $5, %eax

moves a 32 bit number through the 64 bit data bus into the CPU register.

I have read the following:

An x86-64 instruction may be at most 15 bytes in length.

My question is, how is this possible if a data bus is at max 64 bits? how can it handle an instruction with 120 bits. Does the CPU fetch it at multiple cycles?

My second question is, are there special registers with bigger length to store all that 120 bits?

Upvotes: 2

Views: 2212

Answers (2)

Alexis Wilke
Alexis Wilke

Reputation: 20818

Instruction Encoding

A modern X86 instruction is built from the following:

  • Prefixes (0, 1, 2, 3, 4)
  • VEX (0, 2, 3)
  • OPCODE (1)
  • ModR/M (1)
  • SIB (0,1)
  • DISP (0, 1, 2, 4)
  • IMM (0, 1, 2, 4)

A prefix is zero to four bytes:

Group 1: LOCK or REP
Group 2: Segments (CS, SS, DS, ES, FS, GS—not all available in 64 bits) and branch hints (i.e. is a branch more likely to be taken or not?)
Group 3: Operand size (66H, is mandatory for some instructions!)
Group 4: Address size

VEX

VEX is for the AVX extension (mostly)

OPCODE

OPCODE is the actual instruction, only 8 bits if you do not count the VEX and some other prefixes/special bytes such as the famous 0F. (On 8086 & 80186 processes the code 0F represented POP CS which later was repurposed for extended OPCODEs.)

ModR/M defines the mode

It tells us which register and/or memory mode is used along this instructions. Some instructions do not support all the available modes.

Scale, Index, Base

SIB is an extension to the ModR/M.

Displacment

DISP is the displacement, an immediate added to an address register (as in [ESP+13]) It can also be the direct address to a memory location.

Immediate

IMM an immediate value (in MOV EBX, $8 — the 8 is the value loaded in EBX, the immediate value.)

Note that IMM is generally limited to 32 bits. The REX can be used to get 64 bits, but it's not available with all instructions (because the total number of bytes for any one instruction is 15 bytes). To load 64 bits in a register, you always load it from memory. One way of doing so is to use an IP based address. (Something like this: MOV R8, [RIP, -42]) Also I've noticed that in the past compilers such as gcc did not use that instruction. With 64 bit processors, though, a 32 bit displacement is available so the value can be pretty much anywhere (±2Gb).

Loading of Instructions

The 64 bit processors load instructions in the instruction cache. It loads 16 bytes at a time (it may vary depending on the processor). The processor then interprets those bytes. Depending on the processor, it may convert those bytes to a set of RISC instructions or just execute the instructions directly.

For example, the LOOP label instructions is really the near equivalent of at least two instructions:

SUB ECX, 1
JNZ label

Some processors had a hard time with such in the past so a LOOP was very slow. One reason is that a SUB changes many of the EFLAGS when LOOP changes none.

The interpreter does not load instructions in a register. It loads it in the CPU and handles it in the corresponding Unit (ALU, ACU, FPU, etc.) There is the RIP register that points to the current instruction, though. As far as you're concerned, the RIP is always pointing either at the start of the current instruction or the start of the next instruction.

How it really is implemented, I do not know. They probably very quickly (instantaneously) determine which unit is concerned and push the instruction there. The size is not that complicated to determine so they can quickly get all the bytes and push them on the concerned unit FIFO, probably as a 15 or 16 bytes value (i.e. one item in the FIFO is most certainly always 16 bytes, one byte may be ignored, which manes the hardware does not even have lines to read it!) Those bytes would be positioned at the same location each time. So if the input does not have a LOCK or REP, it would put say 00h in that FIFO byte.

Note that moving 16 bytes in a FIFO between units is nothing. GPUs have been moving much larger amounts of data for years in their FIFOs.

You could say that these FIFOs are additional registers. The register file is the same thing as a FIFO, only it has random access instead of "PUSH/POP" type of a mechanism. Both use similar technologies, a.k.a. memory, to keep data in a FIFO and in a register.

Documentation

I would suggest the first document, currently titled:

Intel® 64 and IA-32 architectures software developer’s manual combined volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4

From Intel as a good read about the available instructions (not absolutely everything, but more than enough to get started!)

Upvotes: 1

Peter Cordes
Peter Cordes

Reputation: 365637

Instruction fetch is a separate datapath from data load/store. It's not done using 64-bit mov instructions. There is dedicated logic that handles fetching and decoding variable-length unaligned x86 instructions.

A single instruction can span a 4k page boundary so its bytes are from 2 discontiguous physical pages! The front-end has to be able to fetch instruction bytes and concatenate them in a buffer.

Even 8086 had a small instruction prefetch buffer, although that wasn't necessarily needed for decoding because on 8088 it was smaller than the longest instruction (not including prefixes.)


See David Kanter's Sandybridge writeup for a diagram of the front-end in Sandybridge (and Nehalem and Bulldozer). Also Agner Fog's microarch guide. See https://en.wikichip.org/wiki/amd/microarchitectures/zen#Decode for more about the front-end in recent AMD.

On P6 and SnB-family Intel CPUs, code fetch and predecode (to find insn boundaries) happens in 16-byte blocks, finding lengths for up to 6 instructions per cycle and consuming up to 16 bytes of x86 machine code per cycle. If an instruction runs past the end of a block, the predecoder keeps those bytes around until the next cycle. Agner Fog's microarch pdf has some details about optimizing to avoid pre-decode bottlenecks; x86 decoding is hard. e.g. an operand-size prefix changes the length of the rest of the instruction in some cases. e.g. a 66 prefix is the only difference between add eax, imm32 (5 bytes) and add ax, imm16 (66 + 3 bytes). The predecoders in Intel CPUs stall in this case, taking extra cycles to handle it. (Alexis' answer claims that length-finding is easy. It is most certainly not easy with all the ISA extensions that have accumulated over the years, where a VEX prefix is an invalid encoding of another instruction for example. And it gets that much harder when you're trying to do multiple instructions in parallel, because you have to consider multiple starting points for all instructions after the first one. Older CPUs used to be slow to decode prefixes, e.g. taking an extra cycle per prefix or even escape byte. But modern mainstream Intel (not low-power) can handle any number of prefixes with no penalty.)

Instructions are fed to the decoders up to 4 at a time (or 5 or 6 with macro-fusion). Depending on the uarch, this can produce up to 7 micro-ops (uops) (4-1-1-1 pattern on Core2/Nehalem), 4 (SnB-family before Skylake), or 5 (Skylake). SKL still only has 4 decoders, but allows them to produce up to 5 uops, e.g. for patterns like 2-1-1-1.

enter image description here

Decoding x86 instructions in parallel is such a bottleneck that modern CPUs (Intel since SnB-family, AMD since Zen) cache decoded uops to shortcut that for hot portions of code. Pentium 4's trace cache was an early experiment in that direction which worked out poorly (and it didn't have the decoder throughput to maintain acceptable performance on trace cache misses).

See also What's the relationship between early 90s Pentium microprocessor and today's Intel designs? on retrocomputing, where my answer talks some about why P4 was a CPU-architecture dead end, and how P6-family (PPro / PIII) evolved into Intel's current Sandybridge-family.


All x86-64 CPUs are new enough to be high performance with wide internal data paths, but 16 and 32-bit CPUs have the same 15-byte max length (including redundant prefixes). They would probably use a buffer at least big enough to hold an instructions not including prefixes, if they decode those separately before looking at the opcode, modrm + extra addressing mode bytes, and/or immediate.

Except for original 8086, where a 64k code segment full of REP prefixes for one instruction is valid. At that point Intel hadn't defined any limitations on instruction length, and 8086 decoded prefixes separately from the rest of the instruction.


Also related:

Upvotes: 7

Related Questions