How is a 15 bytes instruction transferred form memory to CPU?

Question

Assuming we are using a x86-64 machine, it means it's general registers are 64 bits long, its data bus can handle 64 bits at a time, its ALU can handle at max 64 bit number (right?).

Having a simple instruction like

MOV $5, %eax

moves a 32 bit number through the 64 bit data bus into the CPU register.

I have read the following:

An x86-64 instruction may be at most 15 bytes in length.

My question is, how is this possible if a data bus is at max 64 bits? how can it handle an instruction with 120 bits. Does the CPU fetch it at multiple cycles?

My second question is, are there special registers with bigger length to store all that 120 bits?

Alexis Wilke · Accepted Answer

Instruction Encoding

A modern X86 instruction is built from the following:

Prefixes (0, 1, 2, 3, 4)
VEX (0, 2, 3)
OPCODE (1)
ModR/M (1)
SIB (0,1)
DISP (0, 1, 2, 4)
IMM (0, 1, 2, 4)

A prefix is zero to four bytes:

Group 1: LOCK or REP
Group 2: Segments (CS, SS, DS, ES, FS, GS—not all available in 64 bits) and branch hints (i.e. is a branch more likely to be taken or not?)
Group 3: Operand size (66H, is mandatory for some instructions!)
Group 4: Address size

VEX

VEX is for the AVX extension (mostly)

OPCODE

OPCODE is the actual instruction, only 8 bits if you do not count the VEX and some other prefixes/special bytes such as the famous 0F. (On 8086 & 80186 processes the code 0F represented POP CS which later was repurposed for extended OPCODEs.)

ModR/M defines the mode

It tells us which register and/or memory mode is used along this instructions. Some instructions do not support all the available modes.

Scale, Index, Base

SIB is an extension to the ModR/M.

Displacment

DISP is the displacement, an immediate added to an address register (as in [ESP+13]) It can also be the direct address to a memory location.

Immediate

IMM an immediate value (in MOV EBX, $8 — the 8 is the value loaded in EBX, the immediate value.)

Note that IMM is generally limited to 32 bits. The REX can be used to get 64 bits, but it's not available with all instructions (because the total number of bytes for any one instruction is 15 bytes). To load 64 bits in a register, you always load it from memory. One way of doing so is to use an IP based address. (Something like this: MOV R8, [RIP, -42]) Also I've noticed that in the past compilers such as gcc did not use that instruction. With 64 bit processors, though, a 32 bit displacement is available so the value can be pretty much anywhere (±2Gb).

Loading of Instructions

The 64 bit processors load instructions in the instruction cache. It loads 16 bytes at a time (it may vary depending on the processor). The processor then interprets those bytes. Depending on the processor, it may convert those bytes to a set of RISC instructions or just execute the instructions directly.

For example, the LOOP label instructions is really the near equivalent of at least two instructions:

SUB ECX, 1
JNZ label

Some processors had a hard time with such in the past so a LOOP was very slow. One reason is that a SUB changes many of the EFLAGS when LOOP changes none.

The interpreter does not load instructions in a register. It loads it in the CPU and handles it in the corresponding Unit (ALU, ACU, FPU, etc.) There is the RIP register that points to the current instruction, though. As far as you're concerned, the RIP is always pointing either at the start of the current instruction or the start of the next instruction.

How it really is implemented, I do not know. They probably very quickly (instantaneously) determine which unit is concerned and push the instruction there. The size is not that complicated to determine so they can quickly get all the bytes and push them on the concerned unit FIFO, probably as a 15 or 16 bytes value (i.e. one item in the FIFO is most certainly always 16 bytes, one byte may be ignored, which manes the hardware does not even have lines to read it!) Those bytes would be positioned at the same location each time. So if the input does not have a LOCK or REP, it would put say 00h in that FIFO byte.

Note that moving 16 bytes in a FIFO between units is nothing. GPUs have been moving much larger amounts of data for years in their FIFOs.

You could say that these FIFOs are additional registers. The register file is the same thing as a FIFO, only it has random access instead of "PUSH/POP" type of a mechanism. Both use similar technologies, a.k.a. memory, to keep data in a FIFO and in a register.

Documentation

I would suggest the first document, currently titled:

Intel® 64 and IA-32 architectures software developer’s manual combined volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4

From Intel as a good read about the available instructions (not absolutely everything, but more than enough to get started!)