Reputation: 5073
There are 2 possible techniques shown below which do the same task.
I would like to know if there will be any performance difference between the two.
I think the first technique will suffer due to branch prediction as contents of A
are random.
Technique 1:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
{
private:
bool flag;
public:
void setFlag(bool f) {flag = f;}
};
int main()
{
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
time(&mytime1);
for(int test = 0; test < 5000; test++)
{
for(int i = 0; i < SIZE; i++)
{
if(A[i] > 100)
obj.setFlag(true);
else
obj.setFlag(false);
}
}
time(&mytime2);
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
}
Result:
Sat May 03 20:08:07 2014
Sat May 03 20:08:32 2014
i.e. Time taken = 25sec
Technique 2:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
{
private:
bool flag;
public:
void setFlag(bool f) {flag = f;}
};
int main()
{
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
time(&mytime1);
for(int test = 0; test < 5000; test++)
{
for(int i = 0; i < SIZE; i++)
{
obj.setFlag(A[i] > 100);
}
}
time(&mytime2);
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
}
Result:
Sat May 03 20:08:42 2014
Sat May 03 20:09:10 2014
i.e. Time taken = 28sec
The compilation is done using MinGW 64 bt compiler with no flags. From the results it looks like the opposite is happening.
EDIT:
After making the check for RAND_MAX / 2 instead of 100, I am getting the following results:
Technique 1: 70sec
Technique 2: 28sec
So it becomes clear now that Technique 2 is better than technique 1 and can be explained on the basis of branch prediction failure phenomenon.
Upvotes: 0
Views: 197
Reputation: 19706
I agree with the fact that this isn't very interesting for practical code (especially when it dissapears with -O3), but for the sake of academic interest: In some conditions it may be better to rely on the branch predictor.
On one hand, in this particular case the branch is almost always going to be not-taken (as RAND_MAX >> 100), which is easy to predict both interms of branch resolution as well as the next IP address. Try converting the prediciton to a 50% chance and then benchmark this.
On the other hand, the second operation turns the stores done to the obj
flag into being data-dependent with the loads from A[i]
. These loads are going to be slow as your dataset is 1000000*sizeof(A) bytes at least (almost 4MB), meaning that it could be either in the L3 cache or the memory - either way that's quiet a few cycles per each new line (once every few accesses) - when the writes to the flag were independent, they could queue in parallel, now you have to stall them until you get the data. In Theory, the CPU should be able to "pipeline" this, since stores are performed much later than loads along the pipeline on most CPUs, but in practice you're limited by the size of the execution window, in most machines that would be ~100 I believe), so if the store of the current iteration is stalled, you won't be able to launch too far ahead the loads required for the future iterations.
In other words - you may be losing due to the fact that CPUs have a fairly decent branch prediction, but no (or hardly no) data prediction.
Upvotes: 1
Reputation: 385144
With optimisations enabled the binaries are exactly the same, in GCC 4.8 at least: demo.
They're different with optimisations disabled, though: demo.
This very poor attempt at a measurement suggests that the second is actually slower, though both programs run in the same duration in real terms: demo
real 0m0.052s
user 0m0.036s
sys 0m0.012s
real 0m0.052s
user 0m0.044s
sys 0m0.004s
To find out how they really differ in performance with optimisations disabled, you can benchmark properly with more runs.
Frankly, though, since it's irrelevant for your production code I wouldn't even bother.
Upvotes: 3