Why is move ctor slower than copy ctor?

Question

I have the following code to test copy ctor and move ctor of the std::string class, and the result surprised me, move ctor is ~1.4 times slower than copy ctor.

To my understanding, move-constructing doesn't need to allocate memory, for the std::string case, there may be an internal pointer in the move-constructed object directly set to that of the moved object, it should be faster than allocating memory for the buffer and then copy the content from the object when being copy-constructed.

Here is the code:

#include 
#include 

void CopyContruct(const std::string &s) {
  auto copy = std::string(s);
}

void MoveContruct(std::string &&s) {
  auto copy = std::move(s);
  //auto copy = std::string(std::move(s));
}

int main(int argc, const char *argv[]) {
  for (int i = 0; i < 50000000; ++i) {
    CopyContruct("hello world");
    //MoveContruct("hello world");
  }

  return 0;
}

Edit:

From the assembly of the two functions, I can see that for MoveConstruct there's an instantiation of the std::remove_reference class template, I think this should be the culprit but I am not familiar with assembly, anyone can elaborate on that?

The following code is decompiled on https://godbolt.org/ with x86-64 gcc7.2:

CopyContruct(std::__cxx11::basic_string, std::allocator > const&):
  push rbp
  mov rbp, rsp
  sub rsp, 48
  mov QWORD PTR [rbp-40], rdi
  mov rdx, QWORD PTR [rbp-40]
  lea rax, [rbp-32]
  mov rsi, rdx
  mov rdi, rax
  call std::__cxx11::basic_string, std::allocator >::basic_string(std::__cxx11::basic_string, std::allocator > const&)
  lea rax, [rbp-32]
  mov rdi, rax
  call std::__cxx11::basic_string, std::allocator >::~basic_string()
  nop
  leave
  ret
MoveContruct(std::__cxx11::basic_string, std::allocator >&&):
  push rbp
  mov rbp, rsp
  sub rsp, 48
  mov QWORD PTR [rbp-40], rdi
  mov rax, QWORD PTR [rbp-40]
  mov rdi, rax
  call std::remove_reference, std::allocator >&>::type&& std::move, std::allocator >&>(std::__cxx11::basic_string, std::allocator >&)
  mov rdx, rax
  lea rax, [rbp-32]
  mov rsi, rdx
  mov rdi, rax
  call std::__cxx11::basic_string, std::allocator >::basic_string(std::__cxx11::basic_string, std::allocator >&&)
  lea rax, [rbp-32]
  mov rdi, rax
  call std::__cxx11::basic_string, std::allocator >::~basic_string()
  nop
  leave
  ret

Edit2:

Things' getting interesting, I changed std::string to std::vector as @FantasticMrFox mentioned in the comment, the result is the opposite, MoveConstruct is ~1.9 times faster than CopyConstruct, it seems std::remove_reference is not the culprit, but optimization of these two classes may be.

Edit3:

The following code is compiled on MacOS with Apple LLVM version 8.0.0 (clang-800.0.42.1), with optimization flag -O3.

    .section    __TEXT,__text,regular,pure_instructions
    .macosx_version_min 10, 11
    .globl  __Z12CopyContructRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
    .align  4, 0x90
__Z12CopyContructRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE: ## @_Z12CopyContructRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
    .cfi_startproc
## BB#0:
    pushq   %rbp
Ltmp0:
    .cfi_def_cfa_offset 16
Ltmp1:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
Ltmp2:
    .cfi_def_cfa_register %rbp
    pushq   %rbx
    subq    $24, %rsp
Ltmp3:
    .cfi_offset %rbx, -24
    movq    %rdi, %rax
    leaq    -32(%rbp), %rbx
    movq    %rbx, %rdi
    movq    %rax, %rsi
    callq   __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEC1ERKS5_
    movq    %rbx, %rdi
    callq   __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev
    addq    $24, %rsp
    popq    %rbx
    popq    %rbp
    retq
    .cfi_endproc

    .globl  __Z12MoveContructONSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
    .align  4, 0x90
__Z12MoveContructONSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE: ## @_Z12MoveContructONSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
    .cfi_startproc
## BB#0:
    pushq   %rbp
Ltmp4:
    .cfi_def_cfa_offset 16
Ltmp5:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
Ltmp6:
    .cfi_def_cfa_register %rbp
    subq    $32, %rsp
    movq    16(%rdi), %rax
    movq    %rax, -8(%rbp)
    movq    (%rdi), %rax
    movq    8(%rdi), %rcx
    movq    %rcx, -16(%rbp)
    movq    %rax, -24(%rbp)
    movq    $0, 16(%rdi)
    movq    $0, 8(%rdi)
    movq    $0, (%rdi)
    leaq    -24(%rbp), %rdi
    callq   __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev
    addq    $32, %rsp
    popq    %rbp
    retq
    .cfi_endproc

Sebastian Redl · Accepted Answer

This kind of microbenchmark is often misleading, because it doesn't test the thing you think it tests.

However, in your case, I can explain the most likely cause of the measurements you're seeing.

std::string, in all modern implementations, uses something called the "small buffer optimization", or SBO. (@FantasticMrFox's assertion in the comments about using flyweight is wrong. I don't think any popular implementation ever used flyweight except for the empty string. He means copy-on-write, which was used by GNU's standard library in the past, but GNU switched away because a compliant C++11 string cannot use COW.)

In this optimization, some space is reserved internally in the string object to store short strings and avoid heap allocation for them.

This means the copy and move constructors of string are implemented roughly like this:

copy(source) {
  if source length > internal buffer capacity
    allocate space
  copy source buffer to my buffer
}

move(source) {
  if source uses internal buffer {
    copy source buffer to my buffer
    set source length to zero
    set first byte of source buffer to zero
  } else {
    steal source buffer
  }
}

As you can see, the move constructor is a bit more complex. It is also a bit more optimized than that in some implementations, but the general logic stays the same.

So for small buffer strings (and I suspect the one you're testing with fits in your particular implementation), there's simply less work to do to copy, because the source string doesn't need to be reset.

But when you turn on full optimizations, the compiler probably recognizes some dead stores and removes them. (Of course, the compiler might just remove your whole benchmark, since it doesn't actually do anything.)

Why is move ctor slower than copy ctor?

Answers (2)

Related Questions