Reputation: 95
Base on C-style Arrays vs std::vector using std::vector::at, std::vector::operator[], and iterators I run the following benchmarks.
no optimization https://quick-bench.com/q/LjybujMGImpATTjbWePzcb6xyck
From here, vectors definitely perform better in O3.
However, C-style Array is slower with -O3 than -O0
C-style (no opt) : about 2500
C-style (O3) : about 3000
I don't know what factors lead to this result. Maybe it's because the compiler is c++14?
(I'm not asking about std::vector
relative to plain arrays, I'm just asking about plain arrays with/without optimization.)
Upvotes: 0
Views: 222
Reputation: 364200
Your -O0
code wasn't faster in an absolute sense, just as a ratio against an empty
for (auto _ : state) {}
loop.
That also gets slower when optimization is disabled, because the state iterator functions don't inline. Check the asm for your own functions, and instead of an outer-loop counter in %rbx
like:
# outer loop of your -O3 version
sub $0x1,%rbx
jne 407f57 <BM_map_c_array(benchmark::State&)+0x37>
RBX was originally loaded from 0x10(%rdi)
, from the benchmark::State& state
function arg.
You instead get state counter updates in memory, like the following, plus a bunch of convoluted code that materializes a boolean in a register and then tests it again.
# part of the outer loop of your -O0 version
12.50% mov -0x8060(%rbp),%rax
25.00% sub $0x1,%rax
12.50% mov %rax,-0x8060(%rbp)
There are high counts on those instructions because the call map_c_array
didn't inline, so most of the CPU time wasn't actually spent in this function itself. But of the time that was, about half was on these instructions. In an empty loop, or one that called an empty function (I'm not sure which Quick Bench is doing), that would still be the case.
Quick Bench does this to try to normalize things for whatever hardware its cloud VM ends up running on, with whatever competing load. Click the "About Quick Bench" in the dropdown at the top right.
And see the label on the graph: CPU time / Noop time. (When they say "Noop", they don't mean a nop
machine instruction, they mean in a C++ sense.)
An empty loop with a loop counter runs about 6x slower when compiled with optimization disabled (bottlenecked on store-to-load forwarding latency of the loop counter), so your -O0 code is "only" a bit less than 6x slower, not exactly 6x slower.
With a counter in a register, modern x86 CPUs can run loops at 1 cycle per iteration, like looptop: dec %ebx
/ jnz looptop
. dec
has one cycle latency, vs. subtract or dec on a memory location being about 6 cycles since it includes the store/reload. (https://agner.org/optimize/ and https://uops.info/. Also
With that bottleneck built-in to the baseline you're comparing against, it's normal that adding some array-access work inside a loop won't be as much slower as array access vs. an empty loop.
Upvotes: 3
Reputation: 68
Because you aren't benchmarking what you think you're benchmarking. I bothered to look at your code, and found that you're trying to see how fast your CPU can advance the counter in a for loop while seeing how fast your data BUS can transfer data. Is this really something you need to worry about, like ever?
In general, benchmarks outside multi-thousand programs are worthless and will never be taken with a straight face by anyone even remotely experienced in programming, so stop doing that.
Upvotes: -2