Nick
Nick

Reputation: 155

Why does std::set container use much more memory than the size of its data?

For example, we have 10^7 32bit integers. The memory usage to store these integers in an array is 32*10^7/8=40MB. However, inserting 10^7 32bit integers into a set takes more than 300MB of memory. Code:

#include <iostream>
#include <set>

int main(int argc, const char * argv[]) {
    std::set<int> aa;
    for (int i = 0; i < 10000000; i++)
        aa.insert(i);
    return 0;
}

Other containers like map, unordered_set takes even more memory with similar tests. I know that set is implemented with red black tree, but the data structure itself does not explain the high memory usage.

I am wondering the reason behind this 5~8 times original data memory usage, and some workaround/alternatives for a more memory efficient set.

Upvotes: 1

Views: 2018

Answers (1)

Michael Veksler
Michael Veksler

Reputation: 8475

Let's examine std::set implementation in GCC (which is not much different in other compilers). std::set is implemented as a red-black tree on GCC. Each node has a pointer to parent, left, and right nodes and a color enumerator (_S_red and _S_black). This means that besides the int (which is probably 4 bytes) there are 3 pointers (8 * 3 = 24 bytes for a 64-bit system) and one enumerator (since it comes before the pointers in _Rb_tree_node_base, it is padded to 8 byte boundary, so effectively it takes extra 8 bytes).

So far I have counted 24 + 8 + 4 = 36 bytes for each integer in the set. But since the node has to be aligned to 8 bytes, it has to be padded so that it is divisible by 8. Which means each node takes 40 bytes (10 times bigger than int).

But this is not all. Each such node is allocated by std::allocator. This allocator uses new to allocate each node. Since delete can't know how much memory to free, each node also has some heap-related meta-data. The meta-data should at least contain the size of the allocated block, which usually takes 8 bytes (in theory, it is possible to use some kind of Huffman coding and store only 1 byte in most cases, but I never saw anybody do that).

Considering everything, the total for each int node is 48 bytes. This is 12 times more than int. Every int in the set takes 12 times more than it would have taken in an array or a vector.

Your numbers suggest that you are on a 32 bit system, since your data takes only 300 MB. For 32 bit system, pointers take 4 bytes. This makes it 3 * 4 + 4 = 16 byte for the red-black tree related data in nodes + 4 for int + 4 for meta-data. This totals with 24 bytes for each int instead of 4. This makes it 6 times larger than vector for a big set. The numbers suggest that heap meta-data takes 8 bytes and not just 4 bytes (maybe due to alignment constraint).

So on your system, instead of 40MB (had it been std::vector), it is expected to take 280MB.

If you want to save some peanuts, you can use a non-standard allocator for your sets. You can avoid the metadata overhead by using boost's Segregated storage node allocators. But that is not such a big win in terms of memory. It could boost your performance, though, since the allocators are simpler than the code in new and delete.

Upvotes: 7

Related Questions