Christian Andersson
Christian Andersson

Reputation: 23

How to efficiently add the entire English dictionary to a trie data structure

Simply put I want to check if a specified word exists or not.

The lookup needs to be very fast which is why I decided to store the dictionary in a trie. So far so good! My trie works without issues. The problem is filling the trie with a dictionary. What I'm currently doing is looping through every line of a plain text file that is the dictionary and adding each word to my trie.

This is understandably so an extremely slow process. The file contains just about 120 000 lines. If anyone could point me in the right direction for what I could do it would be much appreciated!

This is how I add words to the trie (in Boo):

trie = Trie()

saol = Resources.Load("saol") as TextAsset
text = saol.text.Split(char('\n'))

for new_word in text:
    trie.Add(new_word)

And this is my trie (in C#):

using System.Collections.Generic;

public class TrieNode {
    public char letter;
    public bool word;
    public Dictionary<char, TrieNode> child;

    public TrieNode(char letter) {
        this.letter = letter;
        this.word = false;
        this.child = new Dictionary<char, TrieNode>();
    }
}

public class Trie {
    private TrieNode root;

    public Trie() {
        root = new TrieNode(' ');
    }

    public void Add(string word) {
        TrieNode node = root;
        bool found_letter;

        int c = 1;
        foreach (char letter in word) {
            found_letter = false;

            // if current letter is in child list, set current node and break loop
            foreach (var child in node.child) {
                if (letter == child.Key) {
                    node = child.Value;
                    found_letter = true;
                    break;
                }
            }

            // if current letter is not in child list, add child node and set it as current node
            if (!found_letter) {
                TrieNode new_node = new TrieNode(letter);
                if (c == word.Length) new_node.word = true;
                node.child.Add(letter, new_node);
                node = node.child[letter];
            }

            c ++;
        }
    }

    public bool Find(string word) {
        TrieNode node = root;
        bool found_letter;

        int c = 1;
        foreach (char letter in word) {
            found_letter = false;

            // check if current letter is in child list
            foreach (var child in node.child) {
                if (letter == child.Key) {
                    node = child.Value;
                    found_letter = true;
                    break;
                }
            }

            if (found_letter && node.word && c == word.Length) return true;
            else if (!found_letter) return false;

            c ++;
        }

        return false;
    }
}

Upvotes: 2

Views: 2560

Answers (2)

MarZab
MarZab

Reputation: 2623

Anything you do with CLI yourself will be slower then using the built-in functions. 120k is not that much for a dictionary.

First thing I would do is fire up the code performance tool.

But just some wild guesses: You have a lot of function calls. Just starting with the Boo C# binding in a for loop. Try to pass the whole text block and tare it apart with C#.

Second, do not use a Dictionary. You waste just about as much resources with your code now as you would just using a Dictionary.

Third, sort the text before you go inserting - you can probably make some optimizations that way. Maybe just construct a suffix table.

Upvotes: 0

John Percival Hackworth
John Percival Hackworth

Reputation: 11531

Assuming that you don't have any serious implementation problems, pay the price for populating the trie. After you've populated the trie serialize it to a file. For future needs, just load the serialized version. That should be faster that reconstructing the trie.

-- ADDED --

Looking closely at your TrieNode class, you may want to replacing the Dictionary you used for child with an array. You may consume more space, but have a faster lookup time.

Upvotes: 4

Related Questions