Memory Fetching from processor

Question

Suppose I'm given the following assembly code:

Mul b,b,b

Which means the square of b first calculated and then stored in b. My question is: When the processor tries to fetch the memory from b(the value of b), will that be tried twice or just once given that it's trying to fetch for the same variable?

Peter Cordes · Accepted Answer

3-operand memory-to-memory machines only exist in theoretical computer-science, AFAIK. These days everyone builds register machines, or (in low-end microcontrollers) accumulator machines (usually with one or two pointer registers as well as the actual accumulator), because it's much more efficient to have registers than to need a memory (or cache) store/reload round trip for every step in a chain of calculations.

However, yes it would be possible (and a good idea) to design a CPU to optimize when possible by doing only one cache-read when multiple source operands encode the same address.

I need to find the program size in bytes. So I was just wondering if the value of b will be accessed twice?

These 2 things are unrelated. The machine code still has to encode b twice, unless there's a special "square" instruction that only has room for one source operand. In that case you'd definitely expect it to be accessed only once. (It might not have a separate mnemonic, and just be a different opcode for mul that the assembler can use when both source operands are the same).

Or maybe the machine-encoding lets the 2nd source operand explicitly reference the first source operand, instead of having to independently specify the address of b again. But the CPU could just decode b, same_as_first to b, b and then read b twice. i.e. only handle that special case in the decoders, instead of providing an optimized path in the operand-read stage for that case. Spending the extra transistors to implement that optimization would probably be worth it, but you can't assume anything. (Even in this special case where the instruction encoding has a "ditto" encoding for the second operand.) And BTW, I'm totally making this up; I haven't heard of a real ISA like this. VAX has fully-flexible encoding for both operands, where both can be memory, but AFAIK they can't reference each other.

Intel P6-family does do this optimization for register reads (instead of memory reads), which matters because it has limited read ports from its permanent / retirement register file.

x86 is a register architecture with mostly 2-operand instructions. Most instructions support a memory source or a memory destination (but not both in one instruction). But nevermind that, the interesting analogy here is how P6 handles reading register source operands similar to what you're wondering about for source operands in your 3-operand memory-to-memory architecture.

The Intel P6 microarchitecture is a 3-wide out-of-order design with register renaming. Most "simple" x86 instructions decode to a single internal uop, which is what it actually renames and tracks in the out-of-order core. (Pentium Pro / Pentium II are the original P6 microarchitecture. Later members of the P6 family, Pentium III and Pentium M are 3-wide, while Core2 and Nehalem are 4-wide.)

Sandybridge is a new microarchitecture family which switched to using a physical register file, and no longer has register-read stalls.

P6-family has a permanent register file that holds the retirement state of architectural registers. But the out-of-order machinery keeps register input values in the ReOrder Buffer. (Unlike designs with a physical register file, where the ROB has pointers to PRF entries, instead of the values directly).

If the register input to a uop comes from a uop that hasn't retired yet, the value is still "live" in the ROB. This is the normal case: most code rewrites the same registers repeatedly with new values, especially because 32-bit x86 only has 8 integer registers. And most x86 instructions are 2-operand with a read/write destination, like add edx, ecx. (edx += ecx).

But when renaming a group of uops that has inputs from registers that haven't been written recently (i.e. the uop that wrote that register has retired), the ROB-read stage (which follows the rename stage) has to read all needed "cold" register values into the ROB from the permanent register file.

See Agner Fog's microarch PDF, chapter: Pentium Pro / PII / PIII pipeline, section 6.5 ROB read for more details. In these first-gen P6 CPUs, the permanent register file only has 2 read ports, but 3 uops with up to 2 inputs each can read up to 6 registers total. If they're all cold, the ROB-read stage will take 3 total cycles for that issue group. But if the same cold register is read 6 times, there's no problem: the hardware notices overlap and only does one read.

Some more examples: lea rax, [rdx + rcx*4] would consume 2 read ports if rdx and rcx haven't recently been written recently (so the values aren't still in-flight in the ReOrder Buffer). But lea rax, [rdx + rdx*4] would only consume 1 port.

I used LEA as an example to be more RISC-like with a separate write-only destination. But the performance problem (of register read stalls) is the same either way: the add has to read both source registers.

Other instructions (actually uops) that are renamed/issued in the same group of 3 or 4 uops can also share read ports if any of them read the same "cold" register. e.g. add eax, esi / add edx, esi being renamed in the same group only needs to read esi once. (eax might also be cold for the first add, but the 2nd add has the just-written eax from the first as its input. The ROB-read stage obviously can't read the value yet, so it just marks the first add uop to write its result into the input field of the 2nd add, or something like that.)

Of course, writing eax makes it "live" in the Re-Order Buffer until the instruction retires, which is why P6 can normally run fast even with only a couple read ports for not-recently-written registers. P6 was designed before x86-64 existed (Core2 was the first 64-bit capable P6 member, and Nehalem introduced more register-read bandwidth). Having more registers in x86-64 makes it possible to keep more constants in registers, so you're more likely to be reading registers that haven't been written recently.

Sandybridge switched to a physical register file, which allowed the ROB to grow because each entry is much more compact: instead of needing a copy of every value as an input to each uop, multiple uops reading the same register point to the same PRF entry. Sandybridge also added AVX, which widened vector registers to 256 bits. Having room in each uop entry for two 256b inputs would be pretty crazy.

Memory Fetching from processor

Answers (1)

Related Questions