Tabber33
Tabber33

Reputation: 565

What the heque is going on with the memory overhead of std::deque?

I am working on an external sorting algorithm that uses std::queue and must carefully constrain its memory usage. I have noticed that during the merge phase (which uses several std::queues of fixed length), my memory usage increases to about 2.5X what I expected. Since std::queue by default uses std::deque as its underlying container, I ran some tests on std::deque to determine its memory overhead. Here are the results, running on VC++ 9, in release mode, with a 64-bit process:

When adding 100,000,000 chars to a std::deque, the memory usage grows to 252,216K. Note that 100M chars (1 byte) should occupy 97,656K, so this is an overhead of 154,560K.

I repeated the test with doubles (8 bytes) and saw memory grow to 1,976,676K, while 100M doubles should occupy 781,250K, for an overhead of 1,195,426K!!

Now I understand that std::deque is normally implemented as a linked list of "chunks." If this is true, then why is the overhead proportional to the element size (because of course the pointer size should be fixed at 8 bytes)? And why is it so danged huge?

Can anybody shed some light on why std::deque uses so much danged memory? I'm thinking I should switch my std::queue underlying containers to std::vector as there is no overhead (given that the appropriate size is reserveed). I'm thinking the benefits of std::deque are largely negated by the fact that it has such a huge overhead (resulting in cache misses, page faults, etc.), and that the cost of copying std::vector elements may be less, given that the overall memory usage is so much lower. Is this just a bad implementation of std::deque by Microsoft?

Upvotes: 18

Views: 5153

Answers (3)

Steve Townsend
Steve Townsend

Reputation: 54178

Is it possible that you are running Debug binaries? 252MB for 100M chars does seem like a lot...

You can check attribution of this using umdh to snapshot before and after and then compare the two - might shed some light on why it's larger than you expected.

EDIT: FYI - When I run this outside the debugger on VS2010 I get 181MB with chars.

deque<char> mydequeue;
for (size_t i = 0; i < 100 * 1024 * 1024; ++i)
{
  mydequeue.push_back(char(i));
}

EDIT: Supporting the other answer from @Dialecticus, this gives me the same footprint as double:

struct twoInt64s
{
public:
    twoInt64s(__int64 _a, __int64 _b) : a(_a), b(_b) {}

    __int64 a;
    __int64 b;
};

EDIT: With _DEQUESIZ modified as shown (128 chars per block), 100M chars now takes 113M of memory.

My conclusion is that the remaining overhead you saw is due to management structures for the deque blocks, which have 16 chars of data, plus control info for deque plus more control info for heap manager.

#define _DEQUESIZ   (sizeof (value_type) <= 1 ? 128 \
    : sizeof (value_type) <= 2 ? 8 \
    : sizeof (value_type) <= 4 ? 4 \
    : sizeof (value_type) <= 8 ? 2 \
    : 1)    /* elements per block (a power of 2) */

Moral - if you really want to optimize this for your special purpose, be prepared to play with <deque>. Its behaviour depends critically on the size of your elements, and beyond that on the expected usage pattern.

EDIT: Depending on your knowledge of queue sizes, you might be able to drop in boost::circular_buffer as a replacement for the std::queue container. I bet this would perform more like you want (and expected).

Upvotes: 3

Ben Jackson
Ben Jackson

Reputation: 93930

Without looking at the actual implementation of std::queue you're using, my guess is that its memory allocation looks something like this:

if (new element won't fit) {
    double the size of the backing storage
    realloc the buffer (which will probably copy all elements)
}

The reason for doubling rather than being more conservative is that you want the queue.push_pack operation to have O(1) average time. Since the reallocation may copy the existing elements, a version that only grew the array as needed (1 element at a time) would be O(n^2) as you initially push all of your values into the queue. I'll leave it as an exercise for the reader how the doubling version gives constant average time.

Since you are quoting the size of the entire process, your estimate of about 2x overhead when you push slightly more than a power of 2 (2^26 < 100MM < 2^27) worth of elements seems reasonable. Try stopping at at 2^(n-1), measuring, then pushing a few elements and measuring again.

Upvotes: -2

Dialecticus
Dialecticus

Reputation: 16779

Look at the code for _DEQUESIZ (number of elements per block):

#define _DEQUESIZ   (sizeof (_Ty) <= 1 ? 16 \
    : sizeof (_Ty) <= 2 ? 8 \
    : sizeof (_Ty) <= 4 ? 4 \
    : sizeof (_Ty) <= 8 ? 2 : 1)    /* elements per block (a power of 2) */

It gets smaller if the element is larger. Only for elements larger than 8 bytes will you get the expected behavior (percentual decrease of overhead with increase of element size).

Upvotes: 15

Related Questions