Reputation: 733
I was working on a project that needed fast string search on a large collection of string values. I decided to use Trie for search and this approach was fast. This is part of that project that's relevant to my question:
class TTrieNode{
public:
char c;
bool data;
TTrieNode *left, *mid, *right;
TTrieNode(){
left = mid = right = NULL;
c = data = 0;
}
};
class TTrie{
private:
TTrieNode *root;
TTrieNode *insert(TTrieNode*n, char *s, int idx){
char ch = s[idx];
if(!n){
n = new TTrieNode();
n->c = ch;
}
if(ch < (n->c)){
n->left = insert(n->left, s, idx);
}else if(ch > (n->c)){
n->right = insert(n->right, s, idx);
}else if(idx+1 < strlen(s))
n->mid = insert(n->mid, s, idx+1);
else
n->data = true;
return n;
}
public:
TTrie() {
root = NULL;
}
void insert(char *s) {
root = insert(root, s, 0);
}
};
Everything was good until we tested my Trie on the real data. Based on my calculation on the number of nodes and the amount of space each node takes, it should have taken ~40GBs of RAM, but to my surprise it took ~70GBs. At first I thought this was because of memory alignment to each node(just a raw guess!), so I used __attribute__((packed, aligned(1)))
with my TTrieNode
definition!
Using this didn't make big of a difference. After a lot of tests I used some manual memory allocation. So instead of calling new
each time I want to allocate memory to a new node, I allocated ~50GBs of RAM in the beginnig of my program and used the following custom new function instead:
TTrieNode *memLoc;
int memIdx;
void initMemory(){
memLoc = (TTrieNode*) malloc(MAXNODES * sizeof(TTrieNode));
memIdx = 0;
}
TTrieNode*myNew(){
memLoc[memIdx].left = memLoc[memIdx].right = memLoc[memIdx].mid = NULL;
memLoc[memIdx].c = memLoc[memIdx].data = 0;
return &memLoc[memIdx ++];
}
This was very surprising but this time, the program took EXACTLY the amount of memory I was expecting!
Now my questions are these:
Why is some extra memory for each new (malloc)
? Is there some kind of pointer in the kernel/user level for each memory allocation? I haven't tested my code in windows(or any other operating system) but would like to know if there is some similar behavior on those operating systems as well.
Upvotes: 1
Views: 167
Reputation: 5321
There is an 8 to 16 byte overhead per chunk allocated. In a typical x86_64 allocator, there is an 8 byte overhead needed in order to be able to correctly organize the memory chunks when they get freed. There is also a 16 byte alignment requirement, so that a chunk that is already a multiple of 16 bytes getting the basic 8 byte overhead needs to waste another 8 bytes.
Typical 64-bit design: Each chunk is preceded by an 8-byte control word. Most of the control word is needed to give the size of that chunk, so it can be freed. The bottom few bits are available for other purposes because the size is divisible by 16. The most important of those purposes, is knowing whether the preceding chunk is free. When this chunk is freed, if the preceding one was already free it gets consolidated. It also gets consolidated, if possible, with the next chunk. But being able to do that doesn't take extra info.
The common minimal information is surprising (and elegant), especially the fact that each chunk header must include a bit to say whether the previous chunk is free, but doesn't need a bit to say whether the current chunk is free. For consolidation, you can find the next chunk since you know the size of this chunk. But with minimal information you can't find the previous chunk unless you already know it is free but you don't need to find it unless it is free. So at the end of a free chunk there is a pointer (or equivalently size) to its beginning. So if it is free you can navigate to it from its successor. But if it is not free that is part of the used data, not overhead. You could find out if the successor was free by going to the successor's successor and seeing if its predecessor is free. That is more elegant than using one more of the spare bits, but not necessarily better.
Upvotes: 2