Reputation: 226
I stumbled across this while benchmarking a circular buffer. Can anyone explain how a std::vector manages to outperform a plain array in this instance?
#include <iostream>
#include <vector>
struct uint_pair {
unsigned int a, b;
uint_pair (unsigned int x = 0, unsigned int y = 0) : a(x), b(y) {}
};
struct container {
unsigned int pos;
#ifdef USE_VECTOR
std::vector<uint_pair> data;
container() : pos(0) { data.resize(16); }
#else
uint_pair data[16];
container() : pos(0) {}
#endif
void add(uint_pair val) {
data[++pos % 16] = val;
}
};
int main() {
container c;
for (unsigned int i = 0; i < 1000000000; i++) c.add(uint_pair{i, i});
std::cout << c.data[0].a << " " << c.data[0].b << std::endl;
}
These are the results I'm getting using GCC (similar with Clang):
g++ -o bench -std=c++0x -Os main.cpp -D'USE_VECTOR'
real 0m8.757s
user 0m8.750s
sys 0m0.002s
g++ -o bench -std=c++0x -Os main.cpp
real 0m9.215s
user 0m9.209s
sys 0m0.002s
Upvotes: 11
Views: 2676
Reputation: 61
It seems C++11 compiler generates better code for vector because of operator=(rvalue reference). Firstly, In C++03 compiler plain array twice faster than vector. Secondly, tehre are no difference if you use void set(unsigned int x, unsigned int y) suggested by Adam.
Assembler code for vector
.L49:
leal (%rdi,%rax), %esi
andl $15, %esi
leaq (%rdx,%rsi,8), %rsi
movl %eax, (%rsi)
movl %eax, 4(%rsi)
incq %rax
cmpq $1000000000, %rax
jne .L49
for plain array
.L3:
movl 12(%rsp), %edx
incl %edx
movl %edx, 12(%rsp)
andl $15, %edx
leaq 12(%rsp,%rdx,8), %rdx
movl %eax, 4(%rdx)
movl %eax, 8(%rdx)
incl %eax
cmpl $1000000000, %eax
jne .L3
Upvotes: 0
Reputation: 17329
Here's how you can eliminate the difference. Instead of your add
, use a function like this:
void set(unsigned int x, unsigned int y) {
++pos;
data[pos % 16].a = x;
data[pos % 16].b = y;
}
called like this:
for (unsigned int i = 0; i < 1000000000; i++) c.set(i, i);
This does exactly the same thing as yours, but it avoids semantically creating a temporary object. It looks like when you use a vector the compiler is better able to optimize out the temporary.
$ g++-4.8 -o bench -std=c++11 -Os main.cpp -DUSE_VECTOR
$ time ./bench
999999999 999999999
real 0m0.635s
user 0m0.630s
sys 0m0.002s
$ g++-4.8 -o bench -std=c++11 -Os main.cpp
$ time ./bench
999999999 999999999
real 0m0.644s
user 0m0.639s
sys 0m0.002s
On my machine both the set
and add
methods yield identical performance with vectors. Only the array shows a difference. To further lend credence to optimization, if you compile with -O0 then the array method is slightly faster (but both over 10x slower than with -Os).
This doesn't explain why the compiler treats these two differently. A vector is, after all, backed by an array. Also, an std::array
behaves identically to your C-style array.
Upvotes: 9
Reputation: 395
One problem is with the placement of the "pos" member in your struct.
For the c-array, remember that it is stored contiguously in memory adjacent to your "pos" member. When data is pushed into the c-array, extra instructions have to be issued to offset into the structure past the "pos" member. However, writing to the vector makes no such restriction since its memory is located elsewhere.
To squeeze out more performance, make sure your hottest data is at the front of a cache line.
Edit:
To get the c-array to perform just as fast as the vector, the c-array must be allocated on 8 byte boundaries on a 64 bit machine. So something like:
uint_pair* data;
unsigned int pos;
container() : pos(0) {
std::size_t bufSize = sizeof(uint_pair) * 17;
void* p = new char[bufSize];
p = std::align(8, sizeof(uint_pair), p, bufSize);
data = reinterpret_cast<uint_pair*>(p);
}
With a slightly modified add function:
void add(unsigned int x, unsigned int y) {
auto& ref = data[pos++ % 16];
ref.a = x;
ref.b = y;
}
The c-array now times:
real 0m0.735s
user 0m0.730s
sys 0m0.002s
And the std::vector:
real 0m0.743s
user 0m0.736s
sys 0m0.004s
The standard library implementers are pulling out all the stops for you :)
Upvotes: 2