Reputation: 397
I am reading about trie data structure and found two implementations to implement the children in trie node. Following are the details of the two implementations :-
1) Trie node array of length 26 has been used to store children of the trie node.
2) HashMap has been used to store children of the trie node with character as the key and Trie node as the value.
Please let me know which implementation is better and why?
Upvotes: 16
Views: 4709
Reputation: 21
array is classic textbook implementation, in default choice.
hashmap cost less memory, when alphabetes are large and the actual number of keys in use is relative small.but hashmap itself structure cost more memory than array. so it's trade off and depends on the actual trie's keys.
access speed per child link is almost same O(1).
Upvotes: 2
Reputation: 134005
There are two very common structures used for trie nodes:
CharNode
char letter
CharNode[26] children
CharNode
char letter
Dictionary<char, CharNode> children
Those work well, but they waste a huge amount of memory because the list of children is remarkably sparse. In my opinion, neither gives a performance benefit that offsets the memory cost. I prefer to use:
CharNode
char letter
CharNode[] children
or
CharNode
char letter
CharNode* firstChild
CharNode* sibling
In the first case, the children
array is variably sized to hold just the number of children actually used, and the children are arranged with the most frequently used letter first. Sequential search finds the required child.
In the second case, you have a linked list of children, and each child has a sibling pointer. Again, children are arranged in the list based on frequency.
I prefer the second, because in many runtime environments the cost of allocating an array is pretty high. In .NET, for example, array allocation overhead is on the order of 50 bytes. Considering that a trie node often has fewer than five children, the array allocation overhead is larger than the data that the array holds. With the linked list arrangement, you don't waste any memory.
Sequential search of the small child list is very fast, because the list of children to search is usually very short and the distribution of letter frequencies is usually very skewed. That is, the first two children usually are used far more frequently than the rest. So on average you'll only have to search two or three child nodes.
Either of these saves a huge amount of memory, which can make for a faster program. My tests haven't shown an appreciable performance hit in going with these alternate structures.
Upvotes: 4
Reputation: 5925
It depends - a usual trade-off between memory and speed.
If your strings are short and you don't have memory issues then of course go for the array. This way you make your searches faster. Also good if your letters are evenly distributed among words.
If your strings can be large and there are some letters that occur very rarely then go for the hash map. This way you don't occupy too much unused memory. That's also better if your alphabet is much larger than 26 letters.
Array is faster but potentially consumes more memory than HashMap - but not necessary. Imagine that your bag of words contain all the possible 26^N words of length N that can be made of 26 letters. Then HashMap would be both slower and consume more memory.
Upvotes: 6