user14145203
user14145203

Reputation:

C++ large deque - program takes very long time to exit?

Consider the following C++ program:

#include <deque>
#include <iostream>

using namespace std;

int main()
{
    deque<double> d(30000000);
    cout << "Done\n";
}

The memory allocation in the first line only takes a second, but after it prints Done, it takes 33 seconds (!) to exit back to the terminal. Decreasing the number of elements to 20000000 reduces that time to 22 seconds, so clearly it's linear in the number of elements.

I am compiling on Windows 10, and the same thing happened with both GCC 10.2.0 and Visual Studio 2019.

What's going on here? Am I using deque in a way it's not supposed to be used?

EDIT:

#include <deque>
#include <iostream>

using namespace std;

void test_deque()
{
    deque<double> d(30000000);
    cout << "Function done\n";
}

int main()
{
    test_deque();
    cout << "Main done\n";
}

Now it prints Function done and then there is the 33 second delay. So I assume this has to do with the destructor that gets executed when the function exits. But why does it take so long to destruct 240 MB of memory?

EDIT 2: Tried it (the second version) with GCC on Ubuntu and it only takes a fraction of a second to run! Same with some online C++ compilers. Is this a problem specific to Windows?

EDIT 3: With vector it also takes a fraction of a second to run. However, with list (and forward_list) I get a similar extremely long delay.

EDIT 4: Compiling with MSVC in Release (rather than Debug) configuration also takes a fraction of a second. I'm not sure what the GCC equivalent it, but with -O3 (max optimizations) the execution time remains 33 seconds.

Upvotes: 5

Views: 658

Answers (4)

rustyx
rustyx

Reputation: 85286

MSVC has a built-in profiler. We can run it (press Alt-F2) to see that the majority of CPU time is spent inside the constructor and destructor, which invoke deque::resize() and deque::_Tidy() functions, respectively.

msvc profiler

If we drill down further we see that deque::emplace_back() is resulting in quite a lot of code

#define _PUSH_BACK_BEGIN                                                                                \
    if ((_Myoff() + _Mysize()) % _DEQUESIZ == 0 && _Mapsize() <= (_Mysize() + _DEQUESIZ) / _DEQUESIZ) { \
        _Growmap(1);                                                                                    \
    }                                                                                                   \
    _Myoff() &= _Mapsize() * _DEQUESIZ - 1;                                                             \
    size_type _Newoff = _Myoff() + _Mysize();                                                           \
    size_type _Block  = _Getblock(_Newoff);                                                             \
    if (_Map()[_Block] == nullptr) {                                                                    \
        _Map()[_Block] = _Getal().allocate(_DEQUESIZ);                                                  \
    }

#define _PUSH_BACK_END ++_Mysize()

    template <class... _Valty>
    decltype(auto) emplace_back(_Valty&&... _Val) {
        _Orphan_all();
        _PUSH_BACK_BEGIN;
        _Alty_traits::construct(
            _Getal(), _Unfancy(_Map()[_Block] + _Newoff % _DEQUESIZ), _STD forward<_Valty>(_Val)...);
        _PUSH_BACK_END;

#if _HAS_CXX17
        return back();
#endif // _HAS_CXX17
    }

Disassembly view:

    template <class... _Valty>
    decltype(auto) emplace_back(_Valty&&... _Val) {
00007FF674A238E0  mov         qword ptr [rsp+8],rcx  
00007FF674A238E5  push        rbp  
00007FF674A238E6  push        rdi  
00007FF674A238E7  sub         rsp,138h  
00007FF674A238EE  lea         rbp,[rsp+20h]  
00007FF674A238F3  mov         rdi,rsp  
00007FF674A238F6  mov         ecx,4Eh  
00007FF674A238FB  mov         eax,0CCCCCCCCh  
00007FF674A23900  rep stos    dword ptr [rdi]  
00007FF674A23902  mov         rcx,qword ptr [rsp+158h]  
00007FF674A2390A  lea         rcx,[__0657B1E2_deque (07FF674A3E02Fh)]  
00007FF674A23911  call        __CheckForDebuggerJustMyCode (07FF674A21159h)  
        _Orphan_all();
00007FF674A23916  mov         rcx,qword ptr [this]  
00007FF674A2391D  call        std::deque<double,std::allocator<double> >::_Orphan_all (07FF674A217FDh)  
        _PUSH_BACK_BEGIN;
00007FF674A23922  mov         rcx,qword ptr [this]  
00007FF674A23929  call        std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)  
00007FF674A2392E  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23935  mov         rcx,qword ptr [this]  
00007FF674A2393C  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23941  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23948  mov         rcx,qword ptr [rcx]  
00007FF674A2394B  add         rcx,qword ptr [rax]  
00007FF674A2394E  mov         rax,rcx  
00007FF674A23951  xor         edx,edx  
00007FF674A23953  mov         ecx,2  
00007FF674A23958  div         rax,rcx  
00007FF674A2395B  mov         rax,rdx  
00007FF674A2395E  test        rax,rax  
00007FF674A23961  jne         std::deque<double,std::allocator<double> >::emplace_back<>+0D0h (07FF674A239B0h)  
00007FF674A23963  mov         rcx,qword ptr [this]  
00007FF674A2396A  call        std::deque<double,std::allocator<double> >::_Mapsize (07FF674A214BFh)  
00007FF674A2396F  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23976  mov         rcx,qword ptr [this]  
00007FF674A2397D  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23982  mov         rax,qword ptr [rax]  
00007FF674A23985  add         rax,2  
00007FF674A23989  xor         edx,edx  
00007FF674A2398B  mov         ecx,2  
00007FF674A23990  div         rax,rcx  
00007FF674A23993  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A2399A  cmp         qword ptr [rcx],rax  
00007FF674A2399D  ja          std::deque<double,std::allocator<double> >::emplace_back<>+0D0h (07FF674A239B0h)  
00007FF674A2399F  mov         edx,1  
00007FF674A239A4  mov         rcx,qword ptr [this]  
00007FF674A239AB  call        std::deque<double,std::allocator<double> >::_Growmap (07FF674A21640h)  
00007FF674A239B0  mov         rcx,qword ptr [this]  
00007FF674A239B7  call        std::deque<double,std::allocator<double> >::_Mapsize (07FF674A214BFh)  
00007FF674A239BC  mov         rax,qword ptr [rax]  
00007FF674A239BF  lea         rax,[rax+rax-1]  
00007FF674A239C4  mov         qword ptr [rbp+0F8h],rax  
00007FF674A239CB  mov         rcx,qword ptr [this]  
00007FF674A239D2  call        std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)  
00007FF674A239D7  mov         qword ptr [rbp+100h],rax  
00007FF674A239DE  mov         rax,qword ptr [rbp+100h]  
00007FF674A239E5  mov         rax,qword ptr [rax]  
00007FF674A239E8  mov         qword ptr [rbp+108h],rax  
00007FF674A239EF  mov         rax,qword ptr [rbp+0F8h]  
00007FF674A239F6  mov         rcx,qword ptr [rbp+108h]  
00007FF674A239FD  and         rcx,rax  
00007FF674A23A00  mov         rax,rcx  
00007FF674A23A03  mov         rcx,qword ptr [rbp+100h]  
00007FF674A23A0A  mov         qword ptr [rcx],rax  
00007FF674A23A0D  mov         rcx,qword ptr [this]  
00007FF674A23A14  call        std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)  
00007FF674A23A19  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23A20  mov         rcx,qword ptr [this]  
00007FF674A23A27  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23A2C  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23A33  mov         rcx,qword ptr [rcx]  
00007FF674A23A36  add         rcx,qword ptr [rax]  
00007FF674A23A39  mov         rax,rcx  
00007FF674A23A3C  mov         qword ptr [_Newoff],rax  
00007FF674A23A40  mov         rdx,qword ptr [_Newoff]  
00007FF674A23A44  mov         rcx,qword ptr [this]  
00007FF674A23A4B  call        std::deque<double,std::allocator<double> >::_Getblock (07FF674A21334h)  
00007FF674A23A50  mov         qword ptr [_Block],rax  
00007FF674A23A54  mov         rcx,qword ptr [this]  
00007FF674A23A5B  call        std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)  
00007FF674A23A60  mov         rax,qword ptr [rax]  
00007FF674A23A63  mov         rcx,qword ptr [_Block]  
00007FF674A23A67  cmp         qword ptr [rax+rcx*8],0  
00007FF674A23A6C  jne         std::deque<double,std::allocator<double> >::emplace_back<>+1D7h (07FF674A23AB7h)  
00007FF674A23A6E  mov         rcx,qword ptr [this]  
00007FF674A23A75  call        std::deque<double,std::allocator<double> >::_Getal (07FF674A216CCh)  
00007FF674A23A7A  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23A81  mov         edx,2  
00007FF674A23A86  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23A8D  call        std::allocator<double>::allocate (07FF674A216C7h)  
00007FF674A23A92  mov         qword ptr [rbp+100h],rax  
00007FF674A23A99  mov         rcx,qword ptr [this]  
00007FF674A23AA0  call        std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)  
00007FF674A23AA5  mov         rax,qword ptr [rax]  
00007FF674A23AA8  mov         rcx,qword ptr [_Block]  
00007FF674A23AAC  mov         rdx,qword ptr [rbp+100h]  
00007FF674A23AB3  mov         qword ptr [rax+rcx*8],rdx  
        _Alty_traits::construct(
00007FF674A23AB7  mov         rcx,qword ptr [this]  
00007FF674A23ABE  call        std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)  
00007FF674A23AC3  mov         rax,qword ptr [rax]  
00007FF674A23AC6  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23ACD  xor         edx,edx  
00007FF674A23ACF  mov         rax,qword ptr [_Newoff]  
00007FF674A23AD3  mov         ecx,2  
00007FF674A23AD8  div         rax,rcx  
00007FF674A23ADB  mov         rax,rdx  
00007FF674A23ADE  mov         rcx,qword ptr [_Block]  
00007FF674A23AE2  mov         rdx,qword ptr [rbp+0F8h]  
00007FF674A23AE9  mov         rcx,qword ptr [rdx+rcx*8]  
00007FF674A23AED  lea         rax,[rcx+rax*8]  
00007FF674A23AF1  mov         rcx,rax  
00007FF674A23AF4  call        std::_Unfancy<double> (07FF674A214A6h)  
00007FF674A23AF9  mov         qword ptr [rbp+100h],rax  
00007FF674A23B00  mov         rcx,qword ptr [this]  
00007FF674A23B07  call        std::deque<double,std::allocator<double> >::_Getal (07FF674A216CCh)  
00007FF674A23B0C  mov         qword ptr [rbp+108h],rax  
00007FF674A23B13  mov         rdx,qword ptr [rbp+100h]  
00007FF674A23B1A  mov         rcx,qword ptr [rbp+108h]  
00007FF674A23B21  call        std::_Default_allocator_traits<std::allocator<double> >::construct<double> (07FF674A211E5h)  
            _Getal(), _Unfancy(_Map()[_Block] + _Newoff % _DEQUESIZ), _STD forward<_Valty>(_Val)...);
        _PUSH_BACK_END;
00007FF674A23B26  mov         rcx,qword ptr [this]  
00007FF674A23B2D  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23B32  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23B39  mov         rax,qword ptr [rbp+0F8h]  
00007FF674A23B40  mov         rax,qword ptr [rax]  
00007FF674A23B43  inc         rax  
00007FF674A23B46  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23B4D  mov         qword ptr [rcx],rax  

#if _HAS_CXX17
        return back();
00007FF674A23B50  mov         rcx,qword ptr [this]  
00007FF674A23B57  call        std::deque<double,std::allocator<double> >::back (07FF674A2127Bh)  
#endif // _HAS_CXX17
    }
00007FF674A23B5C  lea         rsp,[rbp+118h]  
00007FF674A23B63  pop         rdi  
00007FF674A23B64  pop         rbp  
00007FF674A23B65  ret

Apparently std::deque doesn't pre-allocate elements, and instead uses a loop to add them one-by-one. So no wonder it is slow.

You can speed up the Debug build by enabling some optimizations (e.g. /Ob1) and reducing runtime checks (e.g. remove /RTC1).

But really, std::deque is just a horrible structure from a performance point of view (a vector of tiny vectors - not cache-friendly at all).

Upvotes: 0

ALX23z
ALX23z

Reputation: 4703

std::deque allocates data in chunks of fixed size varying with platform and type. For double that could be 4KB. So allocating 30,000,000 doubles takes 2.4GB of memory and thus 6,000 allocations/deallocations. With std::list that would be 30,000,000 allocations/deallocations and take a couple GB more memory making it all ridiculously slow.

This might even cause memory fragmentation issues depending on your hardware. And if you run without optimizations it will be even slower.

There is also the privacy issue. Deallocation might be clearing data to ensure that your program didn't leak any information to outside programs.

As mentioned by @orlp as your program is a no-op the whole allocation/deallocation can be optimized out completely which might explain why it speeds up significantly at times.

Upvotes: 0

orlp
orlp

Reputation: 117661

Fundamentally the answer isn't very interesting. Your program is a no-op, so a compiler may optimize out the deque construction. But it doesn't have to.

But first, a legal sane implementation may do any of the following:

  1. Do an allocation for 30000000 float elements and nothing else. The allocator might:

    1. Do the allocation in the most lazy way, doing essentially nothing but some bookkeeping.
    2. Eagerly allocate and page in memory, causing 30000000/page size operations.
    3. Zero-initialize or pattern-initialize (e.g. 0xdeadbeef) to help detect uninitialized usage the memory, causing 30000000 writes.
  2. Allocate (include above) and zero-initialize or pattern-initialize the memory.

  3. Run some sort of destructor over all elements (e.g. zeroing out memory).

  4. Not run a destructor on any elements since it's a built-in type.

Now all of the above are possible options. And since your program is a no-op, a legal compiler may optimize out any or none of these steps. Your system allocator might vary in capabilities, supporting lazy allocation, overcommit, automatic zeroing, etc. So the end result is that you could get any kind of behavior depending on your operating system, compiler version, compiler flags, standard library, etc.

Upvotes: 2

lilucpp
lilucpp

Reputation: 306

It is really slow in debug mode.

MSDN:

Processes that the debugger creates (also known as spawned processes) behave a little differently than processes that the debugger does not create.

Instead of using the standard heap API, processes that the debugger creates use a special debug heap. You can force a spawned process to use the standard heap instead of the debug heap by using the _NO_DEBUG_HEAP environment variable.

Upvotes: 0

Related Questions