Reputation: 16081
Which is a more efficient. A Trie structure like this:
struct TrieNode
{
char letter;
bool isWord;
TrieNode* subNodes[26];
};
or a Trie structure like this:
struct TrieNode
{
char letter;
bool isword;
map<int, TrieNode*> subNodes;
};
Or is there just a better implementation altogether... Also, could someone give me an explanation?
Upvotes: 1
Views: 1641
Reputation: 40649
I would use the first, for simplicity and speed, but conceivably the second could save space.
You don't need the char letter
element in either code.
It's redundant, because the way you look up a word is to take a letter of the key and use it either as an index into the subNode array, or a key into your map, so as to choose a subNode.
Either way, you never need to look at letter
.
The way you know if the word is not in the trie is if you hit a null subnode, or if you exhaust your key without hitting a isWord
subnode.
BTW, if your trie doesn't contain too many words, and if it does not change very often, you will always save about an order of magnitude of speed by transforming it into ad-hoc code.
EDIT What I mean by ad-hoc code is that a trie is a kind of finite-state-machine, and a finite-state-machine is a kind of program. So you write a program to read the sorted dictionary, but instead of building a trie data structure, it writes out a program in your favorite language that looks like this:
// XYZ is the prefix string that corresponds to a node in the trie
bool XYZFunc(char* key){
switch (*key){
case '\0': return true /* if XYZ is a valid word, else false */; break;
case 'a': return XYZaFunc(key+1); break;
case 'b': return XYZbFunc(key+1); break;
// etc. etc.
}
}
This could be a lot of functions, but within reason the compiler should be able to handle it. Then to look up a word you just call the top-level function, and it returns true or false. At each node, the compiler will determine if it needs a jump table or not, so you don't have to worry about that.
Upvotes: 1
Reputation: 11
I used to take the first approach (ie each node has a child for every possible letter of the alphabet), however realised that's hideously inefficient (space wise) and assumes you always have a constant algorithm.
If you instead replace the array with a linked list (and then manipulate it some) you can get to a Binary Tree implementation (however the structure would still be more efficient than a traditional binary tree to lookup, because you're not using string comparisons at each node, and because your keyspace overlaps (finding "the" and finding "then" start with the same comparisons).
ie consider:
struct TrieNode
{
char key;
char *val; /* This is null unless we are an "end node" - you could use the Bool as you do, but I've found this a bit simpler */
struct TrieNode *siblings; /* traversing this is checking different characters at this position in the string */
struct TrieNode *children; /* Travesring this list is looking at subsequent positions in the list */
};
Although in the worst case, the efficiency of this approach starts to blow out, the size of the alphabet determines the maximum number of siblings to check, and for sorting natural language (as opposed to genomes) the trie will usually be quite sparse so we'll never come close to the actual worst case.
Upvotes: 1