Why is the vector range constructer 10 times faster than the fill constructor?

Question

I've tested the following two ways of filling a vector with 100'000 elements:

#include 
#include 
#include 

using std::cout;
using std::endl;
using std::vector;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;

int main()
{
    const int n = 100'000;

    cout << "Range constructor: " << endl;
    high_resolution_clock::time_point t0 = high_resolution_clock::now();

    int nums10[n];
    for (int i = 0; i < n; ++i) {
        nums10[i] = i;
    }
    vector nums11(nums10, nums10 + n);

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    cout << "Duration: " << duration_cast(t1 - t0).count() << endl;


    cout << "Fill constructor: " << endl;
    t0 = high_resolution_clock::now();
    
    vector nums1(n);
    for (int i = 0; i < n; ++i) {
        nums1[i] = i;
    }

    t1 = high_resolution_clock::now();
    cout << "Duration: " << duration_cast(t1 - t0).count() << endl;;
}

In my case, the range constructor runs almost 10 times faster (600 microseconds vs ~5000 microseconds).

Why would there be any performance difference here at all? To my understanding there is an equal amount of assignment operations. Using the range constructor, 100'000 elements are assigned to the array, and then all of them are copied to the vector.

Should this not be identical to the fill constructor, in which 100'000 elements are first default initialized to 0, and then all of them are assigned their "real" values in the for loop?

Nate Eldredge · Accepted Answer

Here's the compiled code on godbolt, with gcc -O0.

In the first test:

the loop to fill the array (lines 49-57 of the assembly) is compiled as a simple loop, with a store to memory on each iteration. It's not well optimized (the index variable is kept on the stack instead of in a register, and redundantly moved back and forth) but at least it's inline code and doesn't make any function calls.
the range constructor is a single call to the precompiled constructor in the library (line 69). We can assume the library function was compiled with aggressive optimizations; it probably calls a highly optimized memcpy in handwritten assembly. So it's probably about as fast as it could be.

In the second test:

The fill constructor is a library call (line 113). Again this is presumably about as fast as it can be (probably calling a hand-optimized memset).
Your loop to fill the vector (lines 118-130) generates a function call to std::vector::operator[] on every iteration (line 126). Even though operator[] itself is probably pretty fast, having been precompiled, the overhead of a function call every time, including the code to reload registers with its arguments, is a killer. If you were compiling with optimizations, this call could be inlined; all that overhead would go away and you'd again just have a single store to memory per iteration. But you're not optimizing, so guess what? Performance is deeply sub-optimal.

With optimizations the second test is apparently faster. This makes sense as it only has to write the same block of memory twice, never reading; whereas the first one involves writing a block of memory, then reading it back to write the second block.

Moral: unoptimized code can be really bad, and two pieces of code that could be optimized into something very similar may turn out very different if such optimizations are not attempted.

Why is the vector range constructer 10 times faster than the fill constructor?

Answers (1)

Related Questions