Performance of std::vector::emplace_back vs. assignment for POD struct

Question

In the below example I see a performance benefit of element assignment (given a vector of the correct size) versus emplace_back (given a vector with reserved storage) for a plain old data (POD) struct. Could someone elaborate where this difference comes from?

Thank you very much in advance!

Notes

This question came up within a bigger project and the below is just a MWE
I did look at the compiler explorer but didn't come up with a good solution
I do know that the assignment only works because the struct is POD, but I did hope that the compiler would optimize the overhead away, given that C++ is supposed to have zero-overhead abstractions
Any general recommendations on the code are also welcome and thank you for your input :)

Code

#include 
#include 
#include 
#include 

using std::cout;
using std::endl;
using std::vector;
using std::size_t;

typedef std::chrono::high_resolution_clock hrc;
typedef std::chrono::microseconds ms;
using std::chrono::duration_cast;

struct Data {
  int x, y;

  inline Data() noexcept: x(0), y(0) {}

  inline Data(int x, int y) noexcept: x(x), y(y) {}
};

int main() {
  constexpr size_t n = 1000000;
  constexpr size_t reps = 5;

  for (size_t rep = 0; rep < reps; rep++) {
    {
      vector vec;
      vec.reserve(n);
      auto t1 = hrc::now();
      for (size_t i = 0; i < n; i++)
        vec.emplace_back(i, -i);
      auto t2 = hrc::now();
      cout << "Emplace Back: " << duration_cast(t2 - t1).count() << " ms" << endl;

      // Check
      size_t sum = 0;
      for (auto const &elem : vec)
        sum += elem.x;
      if (sum != ((n * (n - 1)) / 2))
        return EXIT_FAILURE;
    }

    {
      vector vec;
      vec.resize(n);
      auto t1 = hrc::now();
      for (size_t i = 0; i < n; i++)
        vec[i] = Data(i, i);
      auto t2 = hrc::now();
      cout << "Assign      : " << duration_cast(t2 - t1).count() << " ms" << endl;

      // Check
      size_t sum = 0;
      for (auto const &elem : vec)
        sum += elem.x;
      if (sum != ((n * (n - 1)) / 2))
        return EXIT_FAILURE;
    }
  }
}

Output

sysctl -n machdep.cpu.brand_string && clang++ -v && clang++ -o main -std=c++17 -O3 main.cpp && ./main
Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
Apple clang version 12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Emplace Back: 6162 ms
Assign      : 1000 ms
Emplace Back: 2874 ms
Assign      : 864 ms
Emplace Back: 2149 ms
Assign      : 855 ms
Emplace Back: 2062 ms
Assign      : 934 ms
Emplace Back: 2678 ms
Assign      : 1030 ms

rustyx · Accepted Answer

First of all, two observations:

The index access version is missing a negation of the y argument:
compare vec.emplace_back(i, -i); vs. vec[i] = Data(i, i);
The times you're printing are microseconds, or "µs" ("ms" usually means milliseconds).
Given that 1,000,000 iterations take 864 us, one iteration takes 0.8 ns, or just a couple of CPU cycles. Comparing that to 2.8 ns, we're talking a difference of a few cycles per iteration.

Then some high-level analysis:

The reason the emplace_back version takes longer than assignment via index access could be because emplace_back, in addition to creating a new element, needs to grow the vector by 1. Even when there's enough reserved space, growing of the vector involves (1) a check if there's enough space, and (2) an update of the internal vector size field.

Vector index access on the other hand, performs no bounds checking, let alone update any size. It literally does as much as a raw pointer dereference.

The element type, struct Data is very simple. Creating, copying or overwriting it should take negligible time.

Finally, we analyze the generated assembly to know for sure what's really going on:

emplace_back version:

        leaq    8000000(%rax), %r14
        xorl    %ebp, %ebp
        movq    %rbx, %r12
        movq    %rax, 8(%rsp)
        jmp     .L14
.L73:
        movl    %ebp, %eax
        movd    %ebp, %xmm0
        addq    $1, %rbp
        addq    $8, %rbx
        negl    %eax
        movd    %eax, %xmm5
        punpckldq       %xmm5, %xmm0
        movq    %xmm0, -8(%rbx)
        cmpq    $1000000, %rbp
        je      .L72
.L14:
        movq    %r14, %r15
        subq    %r12, %r15
        cmpq    %rbx, %r14
        jne     .L73

Index access version:

        leaq    8000000(%rax), %r13
        . . .
        pxor    %xmm1, %xmm1
        movq    %rax, %r12
        movq    %rbp, %rax
.L27:
        movdqa  .LC3(%rip), %xmm2
        movdqa  %xmm1, %xmm0
        addq    $16, %rax
        paddq   .LC2(%rip), %xmm1
        paddq   %xmm0, %xmm2
        shufps  $136, %xmm2, %xmm0
        movups  %xmm0, -16(%rax)
        cmpq    %rax, %r13
        jne     .L27

Conclusion:

Overall the compiler did a good job at inlining and eliminating object copying in both versions.
The second version is vectorized, writing 2 elements per iteration (possibly due to lack of negation of y).
The first version is doing more work - we can see the additional counting (addq $1, %rbp) and checking (cmpq $1000000, %rbp).

Performance of std::vector::emplace_back vs. assignment for POD struct

Answers (2)

What you expect to measure

What you likely due measure (due to optimization)

Further explanation

Related Questions