geza
geza

Reputation: 30010

Understanding branch prediction efficiency

I tried to measure branch prediction cost, I created a little program.

It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.

Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:

N   time
=========
8   2.2
9   2.2
10  2.2
11  2.2
12  2.3
13  4.6
14  9.5
15  11.6
16  12.7
20  12.9

For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.

My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?

(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)

Here's the code:

#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2]) {
    uint64_t s1 = s[0];
    uint64_t s0 = s[1];
    uint64_t result = s0 + s1;
    s[0] = s0;
    s1 ^= s1 << 23;
    s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
    return result;
}

int main() {
    uint64_t s[2];
    s[0] = init[0];
    s[1] = init[1];

    uint64_t sum = 0;

#if 1
    const int N = 16;

    unsigned char buffer[1<<N];
    for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

    for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
        for (int j=0; j<1<<N; j++) {
            if (buffer[j]) {
                sum += one;
            }
        }
    }
#else
    for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
        if (next(s)&1) {
            sum += one;
        }
    }

#endif
    std::cout<<sum<<"\n";
}

(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)

Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):

  401270:       80 3c 0c 00             cmp    BYTE PTR [rsp+rcx*1],0x0
  401274:       74 07                   je     40127d <main+0xad>
  401276:       48 03 35 e3 2d 00 00    add    rsi,QWORD PTR [rip+0x2de3]        # 404060 <one>
  40127d:       80 7c 0c 01 00          cmp    BYTE PTR [rsp+rcx*1+0x1],0x0
  401282:       74 07                   je     40128b <main+0xbb>
  401284:       48 03 35 d5 2d 00 00    add    rsi,QWORD PTR [rip+0x2dd5]        # 404060 <one>
  40128b:       48 83 c1 02             add    rcx,0x2
  40128f:       48 81 f9 00 00 01 00    cmp    rcx,0x10000
  401296:       75 d8                   jne    401270 <main+0xa0>

Upvotes: 4

Views: 1916

Answers (1)

geza
geza

Reputation: 30010

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N   time          cycles  branch-misses (%)      approx-time
===============================================================
8    2.2   9,084,889,375         34,806 ( 0.00)    2.2
9    2.2   9,212,112,830         39,725 ( 0.00)    2.2
10   2.2   9,264,903,090      2,394,253 ( 0.06)    2.2
11   2.2   9,415,103,000      8,102,360 ( 0.19)    2.2
12   2.3   9,876,827,586     27,169,271 ( 0.63)    2.3
13   4.6  19,572,398,825    486,814,972 (11.33)    4.6
14   9.5  39,813,380,461  1,473,662,853 (34.31)    9.5
15  11.6  49,079,798,916  1,915,930,302 (44.61)   11.7
16  12.7  53,216,900,532  2,113,177,105 (49.20)   12.7
20  12.9  54,317,444,104  2,149,928,923 (50.06)   12.9

Note: branch-misses (%) is calculated for 2^32 branches

As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.

The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

Upvotes: 2

Related Questions