ashisahu
ashisahu

Reputation: 397

Which is a better implementation to implement a trie node's children - array or hashmap?

I am reading about trie data structure and found two implementations to implement the children in trie node. Following are the details of the two implementations :-

1) Trie node array of length 26 has been used to store children of the trie node.

2) HashMap has been used to store children of the trie node with character as the key and Trie node as the value.

Please let me know which implementation is better and why?

Upvotes: 16

Views: 4709

Answers (3)

leo leo
leo leo

Reputation: 21

array is classic textbook implementation, in default choice.

hashmap cost less memory, when alphabetes are large and the actual number of keys in use is relative small.but hashmap itself structure cost more memory than array. so it's trade off and depends on the actual trie's keys.

access speed per child link is almost same O(1).

Upvotes: 2

Jim Mischel
Jim Mischel

Reputation: 134005

There are two very common structures used for trie nodes:

CharNode
    char letter
    CharNode[26] children

CharNode
    char letter
    Dictionary<char, CharNode> children

Those work well, but they waste a huge amount of memory because the list of children is remarkably sparse. In my opinion, neither gives a performance benefit that offsets the memory cost. I prefer to use:

CharNode
    char letter
    CharNode[] children

or

CharNode
    char letter
    CharNode* firstChild
    CharNode* sibling

In the first case, the children array is variably sized to hold just the number of children actually used, and the children are arranged with the most frequently used letter first. Sequential search finds the required child.

In the second case, you have a linked list of children, and each child has a sibling pointer. Again, children are arranged in the list based on frequency.

I prefer the second, because in many runtime environments the cost of allocating an array is pretty high. In .NET, for example, array allocation overhead is on the order of 50 bytes. Considering that a trie node often has fewer than five children, the array allocation overhead is larger than the data that the array holds. With the linked list arrangement, you don't waste any memory.

Sequential search of the small child list is very fast, because the list of children to search is usually very short and the distribution of letter frequencies is usually very skewed. That is, the first two children usually are used far more frequently than the rest. So on average you'll only have to search two or three child nodes.

Either of these saves a huge amount of memory, which can make for a faster program. My tests haven't shown an appreciable performance hit in going with these alternate structures.

Upvotes: 4

dreamzor
dreamzor

Reputation: 5925

It depends - a usual trade-off between memory and speed.

If your strings are short and you don't have memory issues then of course go for the array. This way you make your searches faster. Also good if your letters are evenly distributed among words.

If your strings can be large and there are some letters that occur very rarely then go for the hash map. This way you don't occupy too much unused memory. That's also better if your alphabet is much larger than 26 letters.

Array is faster but potentially consumes more memory than HashMap - but not necessary. Imagine that your bag of words contain all the possible 26^N words of length N that can be made of 26 letters. Then HashMap would be both slower and consume more memory.

Upvotes: 6

Related Questions