Reputation: 4769
I came across this very nice infographic which gives a rough estimation about the CPU-cylces used for certain operations. While studying I noticed an entry "Right branch of if" which I assumes is the branch "if" is going to take if the condition is met (EDIT: As pointed out in the comments "Right" actually means "a rightly predicted branch"). It made me wonder if there's any (even so minor) speed difference in the if-branch compared to the else branch.
Compare for example the following very condensed code:
#include <cstdio>
volatile int a = 2;
int main()
{
if (a > 5) {
printf("a > 5!");
a = 2;
} else {
printf("a <= 5!");
a = 3;
}
}
Which generates this assembly in x86 64bit:
.LC0:
.string "a > 5!"
.LC1:
.string "a <= 5!"
main:
push rcx
mov eax, DWORD PTR a[rip]
cmp eax, 5
jle .L2
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov DWORD PTR a[rip], 2
jmp .L3
.L2:
mov edi, OFFSET FLAT:.LC1
xor eax, eax
call printf
mov DWORD PTR a[rip], 3
.L3:
xor eax, eax
pop rdx
ret
a:
.long 2
As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?
Upvotes: 7
Views: 1520
Reputation: 50358
Put it shortly, no on mainstream processors, but yes on some old/embedded processors.
Modern mainstream processors are very good at predicting conditions (assuming they are not executed for the first time or the test can be done early). When a branch is correctly predicted, there is almost no cost (beside the use of branching units executed on some specific ports). The processor can speculatively fetch and execute the instructions of the predicted conditional block. When the branch is not correctly predicted, the processor must perform a kind of rollback operation that is pretty expensive (typically 5-15 cycles).
Some embedded processors and old ones use static prediction algorithms. For example, they can assume that the branch is never taken. On such processors, executing the if
block is typically faster than the else
one assuming the compiler does not reorder the blocks (which is quite frequent when optimizations are enabled). Built-ins hints can be provided by developers so to help the compiler generate a code that more efficiently executes by processors using static partitioning. Profile guided optimizations can be used to automatically find conditions that are often true/false and reorder branches accordingly for sake of performance.
Mainstream (server, desktop and high-end mobile) processors use mainly dynamic prediction algorithms. For example, they can track and memorize when a branch is taken or not so to know the probability of the branch to be taken in the future. A processor typically has a limited set of conditional branches tracked. Thus, when a code has too many (imbricated) conditional branch instructions, a static partitioning algorithm can be used as a fallback method instead.
It is worth mentioning that the prediction information can be reset/flushed in some cases, like when a process is pre-empted (as pointed out by @HenriqueBucher). This means prediction can be far less effective when there are many context-switches. Note that speculation can be partially controlled by some specific instruction set so to mitigate vulnerabilities like Spectre.
A jump to a far unpredicted location can be expensive since the processor needs to fetch instructions that may not be in the instruction cache. In your case, it certainly does not matter on mainstream x86-64 processors since a recent x86-64 processor is expected to load all the instructions of the program in the cache pretty quickly. For example, a Skylake processor can fetch 16 bytes/cycle from the instruction cache while a Zen 2 processor reaches 32 bytes/cycle. Both can load 64-byte cache lines to the instruction cache.
There is not a lot of public information about the branch prediction algorithm of recent Intel processors. The one of AMD Zen 2+ processors is pretty well documented: it uses an efficient TAGE predictor combined with a Perceptron so to predict the outcome of a condition based on past statistics. You can find detailed information about its behaviour here.
Upvotes: 5
Reputation: 364220
"Right branch of if" which I assumes is the branch "if" is going to take if the condition is met.
No, it's talking about correct branch prediction, vs. lower down the "wrong branch of if (branch misprediction)" entry in the table. This is a thing whether there's an else
or not in the C source.
A conditional branch needs to be predicted by the front-end to not stall, but guessing wrong means it has to rewind and discard mis-speculated instructions. This applies whether there's an else
part or not. See Mysticial's famous answer on Why is processing a sorted array faster than processing an unsorted array?
An else
just requires an extra unconditional branch in one block of instruction to not run the other. (In your case, that's the jmp .L3
). The other option is to put one block somewhere else (e.g. after the ret
at the bottom of the function) so there are no taken branches on the fast path, but the other path has a taken jcc
and a taken jmp
back to the rejoin the other path after the if/else.
As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?
Yes, branch layout for I-cache locality is a thing, mostly or totally separate from branch prediction.
As you say, I-cache locality is a thing, so putting the rarely-executed block down at the end of the function (after the ret
) is something a compiler can do if it knows (or you tell it with [[likely]]
or [[unlikely]]
) that one side of an if/else is essentially "cold".
Compilers have various heuristics for guessing whether an if
condition is normally true or not, so they can still make a guess even without data from training runs (for profile-guided optimization). It can lay out the machine code with the if
or the else
part first.
Also, the not-taken fall-through path of a branch is often slightly cheaper than the taken side, in terms of front-end throughput. This is assuming correct prediction. Modern CPUs fetch machine code in large contiguous blocks, like 16 or 32 bytes at a time, but on a taken branch they might not get to use all of those bytes, and have to do another fetch before they can see more instructions.
A few ISAs have had machine-code branch hints that compilers can use to tell the hardware branch predictor which way a branch is likely to go, but other than that, there's no mechanism for profile-guided optimization or [[likely]]
/ [[unlikely]]
or GNU C __builtin_expect
to influence branch prediction. Other than static prediction via layout on CPUs that fall back to static prediction; that doesn't include Intel since Sandybridge: Why did Intel change the static branch prediction mechanism over these years?.
See also:
BTW, GCC didn't do a great job here. Both sides of the if/else call printf
and store to a
, just with different values. And it's already using a push/pop for stack alignment, so it could just save/restore RBX to have somewhere to store the compare result across the printf. It could even have made branchless code, using cmov
to select between two format-string pointers.
hand_written_main:
push rbx # realign the stack and save RBX
mov ebx, 2
mov edi, OFFSET FLAT:.LC0 # we can set regs unconditionally, no else needed
cmp dword ptr [RIP+a], 5 # splitting to mov+cmp might be more efficient for macro-fusion of cmp/jcc and maybe micro-fusion
jle .L2
mov edi, OFFSET FLAT:.LC1 # or RIP-rel LEA in a PIE
mov ebx, 3
.L2:
xor eax, eax # number of XMM args in regs for variadic functions
call printf
mov DWORD PTR a[rip], ebx
pop rbx
ret
Interesting to note that this has more instructions executed for one side of the branch, since mov ebx, imm32
and mov edi, imm32
execute unconditionally, and then we run them again if the branch isn't taken. This is like int ebx = 2;
/ if(a<=5) ebx=3;
instead of the =2
in an else
. It's a tradeoff of keeping the branching simpler (no jmp
anywhere) vs. running an extra instruction. Modern x86 CPUs are pretty wide, and the extra instructions are independent of anything so there's instruction-level parallelism here.
And the fun branchless way, packing the strings together 8 bytes apart so we can generate the address of one or the other with a single addressing mode.
hand_written_main:
push rbx # realign the stack and save RBX
xor ebx, ebx
cmp dword ptr [RIP+a], 5
setle bl # bl = (a<=5)
lea edi, [gt5msg + rbx*8] # In a PIE, we'd need [rdi + rbx*8] with a separate RIP-relative LEA
add bl, 2 # 2 + (a<=5) = 2 or 3
xor eax, eax # number of XMM args in regs for variadic functions
call printf
mov DWORD PTR a[rip], ebx
pop rbx
ret
.section .rodata
.p2align 3
gt5msg: .asciz "a > 5!" # <= 8 bytes long, in fact 7 so there's an extra 1 byte of padding.
.p2align 3
le5msg: .asciz "a <= 5!" # starts exactly 8 bytes after the previous message
Upvotes: 2
Reputation: 30569
No, there's no real difference between the if
and else
branch, and that's not what the graphic you linked means by the "right" branch. By "right", it means that the CPU's branch prediction correctly guessed which branch would be taken before the comparison was done.
Modern CPUs are pipelined. They start working on the next instruction before they're done with the current one. This can significantly improve performance, since it means that the CPU can keep all of its different sub-components (the instruction decoder, different parts of the ALU, the memory controller, etc) busy all the time instead of leaving them idle while another component of the CPU does work. I.e. it can be performing an addition, fetching an operand from memory, and decoding an instruction all at the same time.
This pipelining depends on knowing what instruction will be executed next though, and conditional branch instructions throw a wrench in it. In your example, the CPU doesn't know if the next instruction that will need to be executed after jle .L2
is mov edi, OFFSET FLAT:.LC0
or mov edi, OFFSET FLAT:.LC1
until after the cmp
instruction is done, so it guesses. It starts working on one of the branches, and when the cmp
finally finishes it looks to see if it guessed right. If it did, great, the partial work it's done on the following instructions is valid and it keeps going like normal. If it guessed wrong though, all of the work it did on instructions after the jle
has to be thrown away, and it starts working on the other branch, which costs some time. An occasional incorrect guess won't make a noticeable performance difference, but if it guesses wrong very often it can start to add up and make a big difference.
See also the highest rated answer on StackOverflow.
Note that on some older or embedded CPUs the branch prediction algorithm is essentially just "always guess the first one", so on such CPUs the else
path would be slower since branch prediction would never guess that by default. On those CPUs GCC's __builtin_expect
or C++20's [[likely]]
attributes can be useful to tell the compiler which branch is more likely so that it will generate assembly code to make the more likely path "the fist one", even if it's the else
.
Upvotes: 3