Reputation: 279
Does x86(64 too) processor optimise away the multiplication if one of the operands of multiplication happens to be 1.0?
PS:I do not mean compiler optimising a constant multiplication of 1.0.
Upvotes: 1
Views: 117
Reputation: 364200
That's not something I've seen mentioned in docs about instruction latencies or microarchitectures of Intel or AMD CPUs.
I suspect it doesn't happen, because variable latency would interfere with pipelined execution units. (multiple results coming out of the same execution unit in the same clock cycle = extra complexity). Also, there are probably other bits of logic (uop scheduling / queueing, result forwarding networks) that are designed around every uop having known latency. (except for special cases like division / sqrt).
IIRC, one analyst, maybe Agner Fog or David Kanter, suggested that some uops might have been possible to implement with 2 cycle latency, but instead take 3 cycles to match the other uops that their execution port can handle. So constant latency for operations appears to be a big deal for Intel CPU designs, to the extent that it was worth making an operation slower.
Note that we're only talking about latency here. If your multiply isn't part of a loop-carried dependency chain, or you have enough independent multiplies, you can keep the multiplier(s) going with one operation per clock.
Haswell CPUs can sustain a throughput of 2 FP vector multiplies per clock. (256b vectors of 4 doubles or 8 floats). Latency = 5 clock cycles for the result to be ready, regardless of input. Or 1 vector integer multiply per clock. (The vector multiply ALU is on port 0. The vector FP multipliers are on port 0 and port 1).
Avoid multiplying when you can, it leads to long dependency chains. (Usually this comes up for integer multiplies to calculate loop indices. Compilers do a lot better when you write your loop to increment the counter by 16, instead of multiplying i++
by 16 as an array index.)
Upvotes: 2