Reputation: 2134
I am benchmarking part of my GaussQuadrature
template class by calling the member function Quadrature
1M times in a loop and timing the entire loop (if you know a better way, please let me know!). In the intended application of this class, the constructor will be called once and the routine will be called multiple times, but I tried both calling the constructor 1M times inside the loop and once outside just to see which was faster.
If I call the constructor inside the loop, the constructor is called each iteration but it looks like it's allocated to the same space of memory (note for the snippet below, there is the statement std::cout << "ctor\n";
inside the constructor). Is this a coincidence, or something deeper? Here's the snippet:
for(int i = 0; i < 1000000; ++i) {
GaussQuadrature<double> Q(N, func, a, b); // some arguments
Q.Quadrature();
std::cout << "address: " << &Q << "\n";
}
This gives:
ctor
address: 0x7fff23059400
ctor
address: 0x7fff23059400
.
.
.
ctor
address: 0x7fff23059400
Timing this without the print statements in the loop or constructor, i.e.
clock_t tic = clock();
for(int i = 0; i < 1000000; ++i) {
GaussQuadrature<double> Q(N, func, a, b);
Q.Quadrature();
}
float elapsed = (clock() - (float)tic) / CLOCKS_PER_SEC;
std::cout << std::setprecision(10) << "time: " << elapsed << "\n";
results in
time: 12.47999954
Now, if I instead call the constructor outside the loop, then of course it is only called once, but it actually takes longer:
clock_t tic = clock();
GaussQuadrature<double> Q(N, func, a, b);
for(int i = 0; i < 1000000; ++i) {
Q.Quadrature();
}
float elapsed = (clock() - (float)tic) / CLOCKS_PER_SEC;
std::cout << std::setprecision(10) << "time: " << elapsed << "\n";
This gives
time: 12.65999985
I ran this several times with similar results regarding the time. Any thoughts? Thanks!
Upvotes: 2
Views: 487
Reputation: 5962
It could probably be an effect of the optimizations done by the compiler. For loops can be automatically "unrolled" http://en.wikipedia.org/wiki/Loop_unwinding, minimizing branching and this unrolling can also be used with some advantange by the internal parallelism of the CPUs and pipelining.
"If the statements in the loop are independent of each other (i.e. where statements that occur earlier in the loop do not affect statements that follow them), the statements can potentially be executed in parallel."
Try to disable all optimizations and check those times again.
Upvotes: 0
Reputation: 591
First of all, while doing timing tests, you need to close any text outputs such as printfs and couts. This is important because by including them you hide the real performance of your class. Your testing method is OK. It is reasonable to call algorithm for 1 million times. But keep in mind the cache memories. Receiving the same memory pointer can be possible depending on the behaviour of your memory manager. Finally, I guess, after you comment out the text outputs your second implementation will be faster than the first one.
Final suggestion: you may prefer to use QueryPerformanceFrequency and QueryPerformanceCounter functions for more precise timings.
Upvotes: 2