Reputation: 382
I am bench-marking some example functions on my processsor, each core running at 2 GHz. Here are the functions being bench-marked. Also, available on quick-bench
#include <stdlib.h>
#include <time.h>
#include <memory>
class Base
{
public:
virtual int addNumVirt( int x ) { return (i + x); }
int addNum( int x ) { return (x + i); }
virtual ~Base() {}
private:
uint32_t i{10};
};
class Derived : public Base
{
public:
// Overrides of virtual functions are always virtual
int addNumVirt( int x ) { return (x + i); }
int addNum( int x ) { return (x + i); }
private:
uint32_t i{20};
};
static void BM_nonVirtualFunc(benchmark::State &state)
{
srand(time(0));
volatile int x = rand();
std::unique_ptr<Derived> derived = std::make_unique<Derived>();
for (auto _ : state)
{
auto result = derived->addNum( x );
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_nonVirtualFunc);
static void BM_virtualFunc(benchmark::State &state)
{
srand(time(0));
volatile int x = rand();
std::unique_ptr<Base> derived = std::make_unique<Derived>();
for (auto _ : state)
{
auto result = derived->addNumVirt( x );
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_virtualFunc);
static void StringCreation(benchmark::State& state) {
// Code inside this loop is measured repeatedly
for (auto _ : state) {
std::string created_string("hello");
// Make sure the variable is not optimized away by compiler
benchmark::DoNotOptimize(created_string);
}
}
// Register the function as a benchmark
BENCHMARK(StringCreation);
static void StringCopy(benchmark::State& state) {
// Code before the loop is not measured
std::string x = "hello";
for (auto _ : state) {
std::string copy(x);
}
}
BENCHMARK(StringCopy);
Below are the Google-benchmark results.
Run on (64 X 2000 MHz CPU s)
CPU Caches:
L1 Data 32K (x32)
L1 Instruction 64K (x32)
L2 Unified 512K (x32)
L3 Unified 8192K (x8)
Load Average: 0.08, 0.04, 0.00
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_nonVirtualFunc 0.490 ns 0.490 ns 1000000000
BM_virtualFunc 0.858 ns 0.858 ns 825026009
StringCreation 2.74 ns 2.74 ns 253578500
BM_StringCopy 5.24 ns 5.24 ns 132874574
The results show that the execution time is 0.490 ns
and 0.858 ns
for the first two functions.
However, what I do not understand is if my core is running at 2 GHz, this means one cycle is 0.5 ns
, which makes the result seem unreasonable.
I know that the result shown is an average over the number of iterations. And such low execution time means that most of the samples are below 0.5 ns
.
What am I missing?
Edit 1:
From the comments, it seems like adding a constant i
to x
was not a good idea. In fact, I started with calling std::cout
in the virtual and non-virtual functions. This helped me in understanding that virtual functions are not inlined and the call needs to be resolved at run-time.
However, having outputs in the functions being bench-marked does not look nice on the terminal. (Is there a way to share my code from Godbolt?) Can anyone propose an alternative to printing something inside the function?
Upvotes: 3
Views: 2808
Reputation: 1037
Modern compilers just do magnificent things. Not always the most predictable things, but usually good things. You can see that either by watching the ASM as suggested, or by reducing the optimization level. Optim=1 makes the nonVirtualFunc equivalent to virtualFunc in terms of CPU time and optim=0 raises all your function to a similar level (Edit: of course in a bad way; do not do that to actually take performance conclusions).
And yeah, when I first used QuickBench I was confused by "DoNotOptimize" as well. They could better have called it "UseResult()" to signalize what it's actually intended to pretend when benchmarking.
Upvotes: 1