Reputation: 3982
I always believed that the function below foo2
is faster than foo3
untile after a test.
All code below:
#include <iostream>
#include <boost/timer.hpp>
#include <boost/lexical_cast.hpp>
#include <stdint.h>
struct session {
bool operator==(const session& r) const;
uint8_t proto;
uint16_t sport;
uint16_t dport;
uint32_t sip;
uint32_t dip;
};
bool session::operator==(const session& r) const {
return proto == r.proto && sport == r.sport && dport == r.dport
&& sip == r.sip && dip == r.dip;
}
// my L1,L2,L3 total cache size is 16MB, so set it 32MB to overflow all 16MB caches.
static const int SIZE = 32 * 1024 * 1024 / sizeof(session);
int sum;
void foo1(session* p) {
session s = {1, 2, 3, 4, 5};
for (int i = 0; i < SIZE; i++)
if (p[i] == s)
sum++;
}
void foo2(session* p) {
session s = {1, 2, 3, 4, 5};
int n = SIZE - SIZE % 4;
int i;
for (i = 0; i < n; i += 4) {
if (p[i + 0] == s)
sum++;
if (p[i + 1] == s)
sum++;
if (p[i + 2] == s)
sum++;
if (p[i + 3] == s)
sum++;
}
/*
for (; i < SIZE; i++)
if (p[i] == s)
sum++;
*/
}
void foo3(session* p) {
session s = {1, 2, 3, 4, 5};
int n = SIZE - SIZE % 4;
int i;
for (i = 0; i < n; i += 4) {
if (p[i + 0] == s)
sum++;
else if (p[i + 1] == s)
sum++;
else if (p[i + 2] == s)
sum++;
else if (p[i + 3] == s)
sum++;
}
/*
for (; i < SIZE; i++)
if (p[i] == s)
sum++;
*/
}
int main(int argc, char* argv[]) {
if (argc < 2)
return -1;
int n = boost::lexical_cast<int>(argv[1]);
session* p = new session[SIZE];
boost::timer t;
for (int i = 0; i < n; i++)
foo1(p);
std::cout << t.elapsed() << std::endl;
t.restart();
for (int i = 0; i < n; i++)
foo2(p);
std::cout << t.elapsed() << std::endl;
t.restart();
for (int i = 0; i < n; i++)
foo3(p);
std::cout << t.elapsed() << std::endl;
delete [] p;
return 0;
}
test 1000 times, ./a.out 1000
output:
4.36
3.98
3.96
My machine:
CPU: Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz
Caches:
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
In the test, foo2
and foo3
have equivalence performance. Due to foo2
could case
CPU execut all unrolled expressions in parallel, so foo3
is the same. It that right? If so, the else if
syntax is violate the C/C++ basic else if
semantics.
Is there someone explain it? Thanks a lot.
Update
My compiler is gcc 4.4.6 ins RedHat
g++ -Wall -O2 a.cpp
Upvotes: 2
Views: 285
Reputation: 1399
Upvotes: 0
Reputation: 33457
In certain situations, I would expect foo3 to be faster as it can short circuit (some number of branches less than or equal to 4 will occur, whereas in foo2, 4 branches always occurs). In the situation where s
is not equal to any of the 4 array elements (as is extremely likely in this case), foo2 and foo3 are basically the same code. In that case, 4 branches happen in both functions.
Consider what foo3 really looks like (in terms of branches):
if (p[i + 0] == s)
sum++;
else
if (p[i + 1] == s)
sum++;
else
if (p[i + 2] == s)
sum++;
else
if (p[i + 3] == s)
sum++;
This should make it apparent that as long as the if
keep coming up false, the sub branches are going to happen. This means that in the situation where none of the ifs are true, it will execute the same number of operations as foo2 (though not the same functionality).
A crude way to think about it is as if each if
has a cost (not the body of the if, the actual if). In other words, each time an if
is reached in the execution flow, a certain cost is required. This is because a branch must be done. Thinking about it this way, it's clear to see that the cost of each function is the same when foo3's flow doesn't short circuit (when all 4 of foo3
s if
are encountered). (As KillianDS noted, if branch prediction is wrong, it will actually take longer for foo3 since the wrong branch will have to be rewound and the right one executed instead. It seems like for you though that the correct branch is always being chosen.)
It's kind of like how the following snippets of code can have the same performance:
if (short_runtime()) {}
And:
if (short_runtime() && long_runtime()) {}
If short_runtime
returns true, the one with the second function call is obviously going to take longer. If the short_runtime()
return is false though, the long_runtime()
call will never happen, and thus the run times will be the same (or at least extremely similar).
To test out this theory, you can make it so that p[i + 0] == s
will be true. Just value initialize the array (session* p = new session[SIZE]();
), and use session s = {1, 2, 3, 4, 5};
locally.
There seems to be a bit of confusion about the purpose/result of loop unrolling. It's done so that fewer jumps have to happen. If n
things have to be done, instead of n
iterations (jumps) happening with 1 action per iteration, you can have n/k
iterations (jumps) happen instead. When everything can fit in the cache, this can provide a speed boost (if it can't fit in the cache, it can actually kill performance!).
The instructions aren't happening simultaneously (if they were, sum
would need a mutex around it which would be extremely expensive). They're simply happening in sets of 4 instead of sets of 1.
Upvotes: 3
Reputation: 17196
This is branch prediction:
With your program I get these speeds (here foo3 is a bit slower, g++4.8):
7.57
0.63
0.99
Now what happens? You do not initialize your initial array of sessions, as all variables in session
are POD's, they won't get default-initialized and will contain essentially garbage. Therefore the if
's in your code will very quickly converge to always predicting the not-taken branch. In this case foo3
and foo2
are quite simi
lar, foo2
will execute all if's unconditionally, foo3
will do it because it's predicted. I don't really see why foo3
is still a bit slower, I'll have to look at the disassembly code for that.
Now see what happens if I add the following default constructor:
session() : proto(1), sport(2), dport(3), sip(4), dip(5) {}
I of course also had to change the session variables in the foo
's to session s;
Now my timings become:
9.7
1.5
0.75
Suddenly foo3
is a lot faster. Simply because now the branches will be mostly (correctly) predicted as 'taken'. In the case of foo3
this means that only the first condition will be executed and the function exits quickly. foo2
still has to evaluate all branches, even if the prediction is good, which obviously makes it slower.
Upvotes: 2