Lukas Koestler
Lukas Koestler

Reputation: 340

Performance of std::vector::emplace_back vs. assignment for POD struct

In the below example I see a performance benefit of element assignment (given a vector of the correct size) versus emplace_back (given a vector with reserved storage) for a plain old data (POD) struct. Could someone elaborate where this difference comes from?

Thank you very much in advance!

Notes

Code

#include <iostream>
#include <vector>
#include <chrono>
#include <numeric>

using std::cout;
using std::endl;
using std::vector;
using std::size_t;

typedef std::chrono::high_resolution_clock hrc;
typedef std::chrono::microseconds ms;
using std::chrono::duration_cast;

struct Data {
  int x, y;

  inline Data() noexcept: x(0), y(0) {}

  inline Data(int x, int y) noexcept: x(x), y(y) {}
};

int main() {
  constexpr size_t n = 1000000;
  constexpr size_t reps = 5;

  for (size_t rep = 0; rep < reps; rep++) {
    {
      vector<Data> vec;
      vec.reserve(n);
      auto t1 = hrc::now();
      for (size_t i = 0; i < n; i++)
        vec.emplace_back(i, -i);
      auto t2 = hrc::now();
      cout << "Emplace Back: " << duration_cast<ms>(t2 - t1).count() << " ms" << endl;

      // Check
      size_t sum = 0;
      for (auto const &elem : vec)
        sum += elem.x;
      if (sum != ((n * (n - 1)) / 2))
        return EXIT_FAILURE;
    }

    {
      vector<Data> vec;
      vec.resize(n);
      auto t1 = hrc::now();
      for (size_t i = 0; i < n; i++)
        vec[i] = Data(i, i);
      auto t2 = hrc::now();
      cout << "Assign      : " << duration_cast<ms>(t2 - t1).count() << " ms" << endl;

      // Check
      size_t sum = 0;
      for (auto const &elem : vec)
        sum += elem.x;
      if (sum != ((n * (n - 1)) / 2))
        return EXIT_FAILURE;
    }
  }
}

Output

sysctl -n machdep.cpu.brand_string && clang++ -v && clang++ -o main -std=c++17 -O3 main.cpp && ./main
Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
Apple clang version 12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Emplace Back: 6162 ms
Assign      : 1000 ms
Emplace Back: 2874 ms
Assign      : 864 ms
Emplace Back: 2149 ms
Assign      : 855 ms
Emplace Back: 2062 ms
Assign      : 934 ms
Emplace Back: 2678 ms
Assign      : 1030 ms

Upvotes: 1

Views: 1153

Answers (2)

rustyx
rustyx

Reputation: 85351

First of all, two observations:

  1. The index access version is missing a negation of the y argument:
    compare vec.emplace_back(i, -i); vs. vec[i] = Data(i, i);
  2. The times you're printing are microseconds, or "µs" ("ms" usually means milliseconds).
    Given that 1,000,000 iterations take 864 us, one iteration takes 0.8 ns, or just a couple of CPU cycles. Comparing that to 2.8 ns, we're talking a difference of a few cycles per iteration.

Then some high-level analysis:

The reason the emplace_back version takes longer than assignment via index access could be because emplace_back, in addition to creating a new element, needs to grow the vector by 1. Even when there's enough reserved space, growing of the vector involves (1) a check if there's enough space, and (2) an update of the internal vector size field.

Vector index access on the other hand, performs no bounds checking, let alone update any size. It literally does as much as a raw pointer dereference.

The element type, struct Data is very simple. Creating, copying or overwriting it should take negligible time.

Finally, we analyze the generated assembly to know for sure what's really going on:

emplace_back version:

        leaq    8000000(%rax), %r14
        xorl    %ebp, %ebp
        movq    %rbx, %r12
        movq    %rax, 8(%rsp)
        jmp     .L14
.L73:
        movl    %ebp, %eax
        movd    %ebp, %xmm0
        addq    $1, %rbp
        addq    $8, %rbx
        negl    %eax
        movd    %eax, %xmm5
        punpckldq       %xmm5, %xmm0
        movq    %xmm0, -8(%rbx)
        cmpq    $1000000, %rbp
        je      .L72
.L14:
        movq    %r14, %r15
        subq    %r12, %r15
        cmpq    %rbx, %r14
        jne     .L73

Index access version:

        leaq    8000000(%rax), %r13
        . . .
        pxor    %xmm1, %xmm1
        movq    %rax, %r12
        movq    %rbp, %rax
.L27:
        movdqa  .LC3(%rip), %xmm2
        movdqa  %xmm1, %xmm0
        addq    $16, %rax
        paddq   .LC2(%rip), %xmm1
        paddq   %xmm0, %xmm2
        shufps  $136, %xmm2, %xmm0
        movups  %xmm0, -16(%rax)
        cmpq    %rax, %r13
        jne     .L27

Conclusion:

  1. Overall the compiler did a good job at inlining and eliminating object copying in both versions.
  2. The second version is vectorized, writing 2 elements per iteration (possibly due to lack of negation of y).
  3. The first version is doing more work - we can see the additional counting (addq $1, %rbp) and checking (cmpq $1000000, %rbp).

Upvotes: 3

t.niese
t.niese

Reputation: 40842

What you expect to measure

emplace_back
Creates an instance of Date by emplace_back in the preallocated spaces in vec

assignment
Creates an instance of Date and assigns that to the Date object that already exists in vec.

What you likely due measure (due to optimization)

emplace_back
Same as above: Creates an instance of Date by emplace_back in the preallocated spaces in vec.
Increases the size and checks if new space has to be allocated.

assignment
Just assigns - in this case, because Date is a really simple object - i to the members of x and y of the already created objected in the resize step.

So for the assignment case, you completly miss the time required to create the Date objects because you don't include resize in the measurment.

Further explanation

Your vec.resize(n); in the assign case fills the vector with n elements by default constructing them (a placement new is called n times). In the loop, an assignment to those created objects happens.

For the first case (emplace_back) the size (not the capacity) of the vector is increased by one in each iteration and it starts the lifetime of each added object with a placement new, while for the second case (assign) the size does not change and you only assign the Date object to the already constructed one.

The compiler could optimize the assignment case to just assign x and y to the already existing instance in the vector without creating an intermediate Date object.

To measure in a meaningful way you need to include vec.reserve(n) and vec.resize(n). Currently, you just measure some differences in the optimization process of the compilers.

If you also include vec.reserve(n) and vec.resize(n) the measuremnts are much closer together.

For such simple classes like the Date one there won't be much of a difference between assignment and emplace back, because the construction in the loop of the assignment case can often be optimized away.

When you inlcude the vec.reserve(n) and vec.resize(n) the timing is:

gcc 8.4

Emplace Back: 11594 ms
Assign      : 11516 ms
Emplace Back: 2001 ms
Assign      : 2691 ms
Emplace Back: 2523 ms
Assign      : 1847 ms
Emplace Back: 1956 ms
Assign      : 1277 ms
Emplace Back: 949 ms
Assign      : 903 ms

clang

Emplace Back: 2115 ms
Assign      : 2640 ms
Emplace Back: 765 ms
Assign      : 766 ms
Emplace Back: 666 ms
Assign      : 540 ms
Emplace Back: 535 ms
Assign      : 515 ms
Emplace Back: 537 ms
Assign      : 543 ms

Upvotes: 2

Related Questions