I came across this very nice infographic which gives a rough estimation about the CPU-cylces used for certain operations. While studying I noticed an entry "Right branch of if" which I assumes is the branch "if" is going to take if the condition is met ( EDIT : As pointed out in the comments "Right" actually means "a rightly predicted branch"). It made me wonder if there's any (even so minor) speed difference in the if-branch compared to the else branch. Compare for example the following very condensed code: <a href="https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:%271%27,fontScale:14,fontUsePx:%270%27,j:1,lang:c%2B%2B,selection:(endColumn:27,endLineNumber:11,positionColumn:27,positionLineNumber:11,selectionStartColumn:27,selectionStartLineNumber:11,startColumn:27,startLineNumber:11),source:%27%23include+%3Ccstdio%3E%0A%0Avolatile+int+a+%3D+2%3B%0A%0Aint+main()%0A%7B%0A++++if+(a+%3E+5)+%7B%0A++++++++printf(%22a+%3E+5!!%22)%3B%0A++++++++a+%3D+2%3B%0A++++%7D+else+%7B%0A++++++++printf(%22a+%3C%3D+5!!%22)%3B%0A++++++++a+%3D+3%3B%0A++++%7D%0A%7D%27),l:%275%27,n:%270%27,o:%27C%2B%2B+source+%231%27,t:%270%27)),k:47.31070496083551,l:%274%27,n:%270%27,o:%27%27,s:0,t:%270%27),(g:!((g:!((h:compiler,i:(compiler:gsnapshot,filters:(b:%270%27,binary:%271%27,commentOnly:%270%27,demangle:%270%27,directives:%270%27,execute:%270%27,intel:%270%27,libraryCode:%270%27,trim:%271%27),flagsViewOpen:%271%27,fontScale:14,fontUsePx:%270%27,j:1,lang:c%2B%2B,libs:!(),options:%27-Wall+-Os+--std%3Dc%2B%2B20%27,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:%271%27),l:%275%27,n:%270%27,o:%27x86-64+gcc+(trunk)+(C%2B%2B,+Editor+%231,+Compiler+%231)%27,t:%270%27)),header:(),l:%274%27,m:50,n:%270%27,o:%27%27,s:0,t:%270%27),(g:!((h:output,i:(compilerName:%27x86-64+gcc+12.2%27,editorid:1,fontScale:14,fontUsePx:%270%27,j:1,wrap:%271%27),l:%275%27,n:%270%27,o:%27Output+of+x86-64+gcc+(trunk)+(Compiler+%231)%27,t:%270%27)),k:50,l:%274%27,m:50,n:%270%27,o:%27%27,s:0,t:%270%27)),k:52.689295039164485,l:%273%27,n:%270%27,o:%27%27,t:%270%27)),l:%272%27,n:%270%27,o:%27%27,t:%270%27)),version:4" rel="nofollow noreferrer">Demo #include <cstdio> volatile int a = 2; int main() { if (a > 5) { printf("a > 5!"); a = 2; } else { printf("a <= 5!"); a = 3; } } Which generates this assembly in x86 64bit: .LC0: .string "a > 5!" .LC1: .string "a <= 5!" main: push rcx mov eax, DWORD PTR a[rip] cmp eax, 5 jle .L2 mov edi, OFFSET FLAT:.LC0 xor eax, eax call printf mov DWORD PTR a[rip], 2 jmp .L3 .L2: mov edi, OFFSET FLAT:.LC1 xor eax, eax call printf mov DWORD PTR a[rip], 3 .L3: xor eax, eax pop rdx ret a: .long 2 As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?

Is the if-branch faster than the else branch?

Answers (3)

Reputation: 50358

Put it shortly, no on mainstream processors, but yes on some old/embedded processors.

Modern mainstream processors are very good at predicting conditions (assuming they are not executed for the first time or the test can be done early). When a branch is correctly predicted, there is almost no cost (beside the use of branching units executed on some specific ports). The processor can speculatively fetch and execute the instructions of the predicted conditional block. When the branch is not correctly predicted, the processor must perform a kind of rollback operation that is pretty expensive (typically 5-15 cycles).

Some embedded processors and old ones use static prediction algorithms. For example, they can assume that the branch is never taken. On such processors, executing the if block is typically faster than the else one assuming the compiler does not reorder the blocks (which is quite frequent when optimizations are enabled). Built-ins hints can be provided by developers so to help the compiler generate a code that more efficiently executes by processors using static partitioning. Profile guided optimizations can be used to automatically find conditions that are often true/false and reorder branches accordingly for sake of performance.

Mainstream (server, desktop and high-end mobile) processors use mainly dynamic prediction algorithms. For example, they can track and memorize when a branch is taken or not so to know the probability of the branch to be taken in the future. A processor typically has a limited set of conditional branches tracked. Thus, when a code has too many (imbricated) conditional branch instructions, a static partitioning algorithm can be used as a fallback method instead.

It is worth mentioning that the prediction information can be reset/flushed in some cases, like when a process is pre-empted (as pointed out by @HenriqueBucher). This means prediction can be far less effective when there are many context-switches. Note that speculation can be partially controlled by some specific instruction set so to mitigate vulnerabilities like Spectre.

A jump to a far unpredicted location can be expensive since the processor needs to fetch instructions that may not be in the instruction cache. In your case, it certainly does not matter on mainstream x86-64 processors since a recent x86-64 processor is expected to load all the instructions of the program in the cache pretty quickly. For example, a Skylake processor can fetch 16 bytes/cycle from the instruction cache while a Zen 2 processor reaches 32 bytes/cycle. Both can load 64-byte cache lines to the instruction cache.

There is not a lot of public information about the branch prediction algorithm of recent Intel processors. The one of AMD Zen 2+ processors is pretty well documented: it uses an efficient TAGE predictor combined with a Perceptron so to predict the outcome of a condition based on past statistics. You can find detailed information about its behaviour here.

Upvotes: 5

Peter Cordes

Reputation: 364220

"Right branch of if" which I assumes is the branch "if" is going to take if the condition is met.

No, it's talking about correct branch prediction, vs. lower down the "wrong branch of if (branch misprediction)" entry in the table. This is a thing whether there's an else or not in the C source.

A conditional branch needs to be predicted by the front-end to not stall, but guessing wrong means it has to rewind and discard mis-speculated instructions. This applies whether there's an else part or not. See Mysticial's famous answer on Why is processing a sorted array faster than processing an unsorted array?

An else just requires an extra unconditional branch in one block of instruction to not run the other. (In your case, that's the jmp .L3). The other option is to put one block somewhere else (e.g. after the ret at the bottom of the function) so there are no taken branches on the fast path, but the other path has a taken jcc and a taken jmp back to the rejoin the other path after the if/else.

As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?

Yes, branch layout for I-cache locality is a thing, mostly or totally separate from branch prediction.

As you say, I-cache locality is a thing, so putting the rarely-executed block down at the end of the function (after the ret) is something a compiler can do if it knows (or you tell it with [[likely]] or [[unlikely]]) that one side of an if/else is essentially "cold".

Compilers have various heuristics for guessing whether an if condition is normally true or not, so they can still make a guess even without data from training runs (for profile-guided optimization). It can lay out the machine code with the if or the else part first.

Also, the not-taken fall-through path of a branch is often slightly cheaper than the taken side, in terms of front-end throughput. This is assuming correct prediction. Modern CPUs fetch machine code in large contiguous blocks, like 16 or 32 bytes at a time, but on a taken branch they might not get to use all of those bytes, and have to do another fetch before they can see more instructions.

A few ISAs have had machine-code branch hints that compilers can use to tell the hardware branch predictor which way a branch is likely to go, but other than that, there's no mechanism for profile-guided optimization or [[likely]] / [[unlikely]] or GNU C __builtin_expect to influence branch prediction. Other than static prediction via layout on CPUs that fall back to static prediction; that doesn't include Intel since Sandybridge: Why did Intel change the static branch prediction mechanism over these years?.

See also:

Is it possible to tell the branch predictor how likely it is to follow the branch?
What is the advantage of GCC's __builtin_expect in if else statements?
What exactly happens when a skylake CPU mispredicts a branch? - early branch recovery can start before a mispredicted branch uop reaches retirement, so it can happen in parallel with out-of-order exec working on independent instructions before the branch that was discovered to be mispredicted.
Why did Intel change the static branch prediction mechanism over these years? - static prediction (backward taken, forward not taken) isn't a thing that modern Intel CPUs do anymore, even if the dynamic predictor happens to be cold. IDK if Intel might have changed anything now that flushing the BPU history for Spectre mitigation is a more common thing, making the cold case more common and thus more worth worrying and optimizing for.

BTW, GCC didn't do a great job here. Both sides of the if/else call printf and store to a, just with different values. And it's already using a push/pop for stack alignment, so it could just save/restore RBX to have somewhere to store the compare result across the printf. It could even have made branchless code, using cmov to select between two format-string pointers.

hand_written_main:
   push    rbx                     # realign the stack and save RBX
   mov     ebx, 2
   mov     edi, OFFSET FLAT:.LC0   # we can set regs unconditionally, no else needed

   cmp dword ptr [RIP+a], 5       # splitting to mov+cmp might be more efficient for macro-fusion of cmp/jcc and maybe micro-fusion
   jle     .L2
   mov     edi, OFFSET FLAT:.LC1   # or RIP-rel LEA in a PIE
   mov     ebx, 3
.L2:

   xor     eax, eax          # number of XMM args in regs for variadic functions
   call    printf
   mov     DWORD PTR a[rip], ebx
   pop     rbx
   ret

Interesting to note that this has more instructions executed for one side of the branch, since mov ebx, imm32 and mov edi, imm32 execute unconditionally, and then we run them again if the branch isn't taken. This is like int ebx = 2; / if(a<=5) ebx=3; instead of the =2 in an else. It's a tradeoff of keeping the branching simpler (no jmp anywhere) vs. running an extra instruction. Modern x86 CPUs are pretty wide, and the extra instructions are independent of anything so there's instruction-level parallelism here.

And the fun branchless way, packing the strings together 8 bytes apart so we can generate the address of one or the other with a single addressing mode.

hand_written_main:
   push    rbx                     # realign the stack and save RBX

   xor     ebx, ebx
   cmp   dword ptr [RIP+a], 5
   setle   bl                       # bl = (a<=5)
   lea     edi, [gt5msg + rbx*8]    # In a PIE, we'd need [rdi + rbx*8] with a separate RIP-relative LEA
   add     bl, 2                    # 2 + (a<=5) = 2 or 3

   xor     eax, eax          # number of XMM args in regs for variadic functions
   call    printf
   mov     DWORD PTR a[rip], ebx
   pop     rbx
   ret

.section .rodata
.p2align 3
 gt5msg: .asciz "a > 5!"       # <= 8 bytes long, in fact 7 so there's an extra 1 byte of padding.
.p2align 3
 le5msg: .asciz "a <= 5!"     # starts exactly 8 bytes after the previous message

Upvotes: 2

Miles Budnek

Reputation: 30569

No, there's no real difference between the if and else branch, and that's not what the graphic you linked means by the "right" branch. By "right", it means that the CPU's branch prediction correctly guessed which branch would be taken before the comparison was done.

Modern CPUs are pipelined. They start working on the next instruction before they're done with the current one. This can significantly improve performance, since it means that the CPU can keep all of its different sub-components (the instruction decoder, different parts of the ALU, the memory controller, etc) busy all the time instead of leaving them idle while another component of the CPU does work. I.e. it can be performing an addition, fetching an operand from memory, and decoding an instruction all at the same time.

This pipelining depends on knowing what instruction will be executed next though, and conditional branch instructions throw a wrench in it. In your example, the CPU doesn't know if the next instruction that will need to be executed after jle .L2 is mov edi, OFFSET FLAT:.LC0 or mov edi, OFFSET FLAT:.LC1 until after the cmp instruction is done, so it guesses. It starts working on one of the branches, and when the cmp finally finishes it looks to see if it guessed right. If it did, great, the partial work it's done on the following instructions is valid and it keeps going like normal. If it guessed wrong though, all of the work it did on instructions after the jle has to be thrown away, and it starts working on the other branch, which costs some time. An occasional incorrect guess won't make a noticeable performance difference, but if it guesses wrong very often it can start to add up and make a big difference.

Note that on some older or embedded CPUs the branch prediction algorithm is essentially just "always guess the first one", so on such CPUs the else path would be slower since branch prediction would never guess that by default. On those CPUs GCC's __builtin_expect or C++20's [[likely]] attributes can be useful to tell the compiler which branch is more likely so that it will generate assembly code to make the more likely path "the fist one", even if it's the else.

Upvotes: 3

Is the if-branch faster than the else branch?

Answers (3)

Related Questions