Reputation: 12138
I have the following code to test copy ctor and move ctor of the std::string
class, and the result surprised me, move ctor is ~1.4 times slower than copy ctor.
To my understanding, move-constructing doesn't need to allocate memory, for the std::string
case, there may be an internal pointer in the move-constructed object directly set to that of the moved object, it should be faster than allocating memory for the buffer and then copy the content from the object when being copy-constructed.
Here is the code:
#include <string>
#include <iostream>
void CopyContruct(const std::string &s) {
auto copy = std::string(s);
}
void MoveContruct(std::string &&s) {
auto copy = std::move(s);
//auto copy = std::string(std::move(s));
}
int main(int argc, const char *argv[]) {
for (int i = 0; i < 50000000; ++i) {
CopyContruct("hello world");
//MoveContruct("hello world");
}
return 0;
}
Edit:
From the assembly of the two functions, I can see that for MoveConstruct
there's an instantiation of the std::remove_reference
class template, I think this should be the culprit but I am not familiar with assembly, anyone can elaborate on that?
The following code is decompiled on https://godbolt.org/ with x86-64 gcc7.2:
CopyContruct(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&):
push rbp
mov rbp, rsp
sub rsp, 48
mov QWORD PTR [rbp-40], rdi
mov rdx, QWORD PTR [rbp-40]
lea rax, [rbp-32]
mov rsi, rdx
mov rdi, rax
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
lea rax, [rbp-32]
mov rdi, rax
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()
nop
leave
ret
MoveContruct(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&):
push rbp
mov rbp, rsp
sub rsp, 48
mov QWORD PTR [rbp-40], rdi
mov rax, QWORD PTR [rbp-40]
mov rdi, rax
call std::remove_reference<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>::type&& std::move<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)
mov rdx, rax
lea rax, [rbp-32]
mov rsi, rdx
mov rdi, rax
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&)
lea rax, [rbp-32]
mov rdi, rax
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()
nop
leave
ret
Edit2:
Things' getting interesting, I changed std::string
to std::vector
as @FantasticMrFox mentioned in the comment, the result is the opposite, MoveConstruct
is ~1.9
times faster than CopyConstruct
, it seems std::remove_reference
is not the culprit, but optimization of these two classes may be.
Edit3:
The following code is compiled on MacOS with Apple LLVM version 8.0.0 (clang-800.0.42.1), with optimization flag -O3.
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 11
.globl __Z12CopyContructRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
.align 4, 0x90
__Z12CopyContructRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE: ## @_Z12CopyContructRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
pushq %rbx
subq $24, %rsp
Ltmp3:
.cfi_offset %rbx, -24
movq %rdi, %rax
leaq -32(%rbp), %rbx
movq %rbx, %rdi
movq %rax, %rsi
callq __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEC1ERKS5_
movq %rbx, %rdi
callq __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev
addq $24, %rsp
popq %rbx
popq %rbp
retq
.cfi_endproc
.globl __Z12MoveContructONSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
.align 4, 0x90
__Z12MoveContructONSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE: ## @_Z12MoveContructONSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp4:
.cfi_def_cfa_offset 16
Ltmp5:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp6:
.cfi_def_cfa_register %rbp
subq $32, %rsp
movq 16(%rdi), %rax
movq %rax, -8(%rbp)
movq (%rdi), %rax
movq 8(%rdi), %rcx
movq %rcx, -16(%rbp)
movq %rax, -24(%rbp)
movq $0, 16(%rdi)
movq $0, 8(%rdi)
movq $0, (%rdi)
leaq -24(%rbp), %rdi
callq __ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev
addq $32, %rsp
popq %rbp
retq
.cfi_endproc
Upvotes: 0
Views: 117
Reputation: 275310
When I feed your code to clang or gcc with -O3
I get from clang:
main: # @main
mov eax, 50000000
.LBB0_1: # =>This Inner Loop Header: Depth=1
add eax, -25
jne .LBB0_1
xor eax, eax
ret
and gcc:
main:
xor eax, eax
ret
I did place the functions in an anonymous namespace to get rid of the noise from having to export the functions themselves. But the main is being completely optimized away.
Microbenchmarks are often misleading.
Upvotes: 1
Reputation: 71899
This kind of microbenchmark is often misleading, because it doesn't test the thing you think it tests.
However, in your case, I can explain the most likely cause of the measurements you're seeing.
std::string
, in all modern implementations, uses something called the "small buffer optimization", or SBO. (@FantasticMrFox's assertion in the comments about using flyweight is wrong. I don't think any popular implementation ever used flyweight except for the empty string. He means copy-on-write, which was used by GNU's standard library in the past, but GNU switched away because a compliant C++11 string cannot use COW.)
In this optimization, some space is reserved internally in the string object to store short strings and avoid heap allocation for them.
This means the copy and move constructors of string are implemented roughly like this:
copy(source) {
if source length > internal buffer capacity
allocate space
copy source buffer to my buffer
}
move(source) {
if source uses internal buffer {
copy source buffer to my buffer
set source length to zero
set first byte of source buffer to zero
} else {
steal source buffer
}
}
As you can see, the move constructor is a bit more complex. It is also a bit more optimized than that in some implementations, but the general logic stays the same.
So for small buffer strings (and I suspect the one you're testing with fits in your particular implementation), there's simply less work to do to copy, because the source string doesn't need to be reset.
But when you turn on full optimizations, the compiler probably recognizes some dead stores and removes them. (Of course, the compiler might just remove your whole benchmark, since it doesn't actually do anything.)
Upvotes: 5