David
David

Reputation:

Using a Binary Search Tree as a spell checker

Wondering the most efficent way to make a binary search tree into a spell checker by reading in say 1000 word dictionary file and then having it check another document that say has a couple paragraphs.

Upvotes: 4

Views: 8519

Answers (8)

Adrian
Adrian

Reputation: 5681

As suggested a trie would be more efficient than a binary tree, but you can use a hashmap and hash each word. You have a small dictionary (1000 entries). As you traverse your document, check if the words are in the hashmap. If they're not, the word is assumed to be misspelled.

This won't give you possible correction to a misspelled word. It just tells you yes or no (correct or not).

If you want spelling suggestions for incorrect words you can start from the word in the file, then generate all words 1 edit distance away and add these as children of the initial word. This way you are building a graph. Go 2 levels deep for maximum speed vs accuracy. If you generate a word node that is in the dictionary, you can add it to a list of possible suggestions. At the end, return the list of possible suggestions.

For better spell checking, also try to add in phonetic matching.

sea yuh -> see yah

This method (of creating graphs of strings 1 edit away) is "slow". But it is a good academic exercise. Runtime is O(n^branches).

If interested here is a link to one I built myself (for fun): https://github.com/eamocanu/spellcheck.graph

Some sample graphs: https://github.com/eamocanu/spellcheck.graph/tree/master/graph%20photos

I also added a UI component to it which generates the graphs. This is an external library.

Upvotes: 0

purna
purna

Reputation: 11

This site should help you it has the implementation in java .

Upvotes: 1

Andre Artus
Andre Artus

Reputation: 1890

Seeing that this is a homework question I'm going to assume that you have to use a plain old binary tree (no Red-Black trees, AVL trees, Radix trees, etc.). The answer then is to try to keep the tree balanced as you build it from the word list. One approach is to randomize the list prior to reading it in, this gives reasonable results. But you can get better results if you order the input sequence (using the same comparison as what the tree uses), then recursively subdivide the input returning the midpoint until no elements remain. The result is a balanced tree.

I knocked up three different ways of doing it in C#:

private static IEnumerable<T> BinaryTreeOrder<T>(IList<T> range, int first, int last)
{
  if (first > last)
  {
    yield break;
  }

  int mid = (first + last) / 2;
  yield return range[mid];
  foreach (var item in BinaryTreeOrder(range, first, mid - 1))
  {
    yield return item;
  }
  foreach (var item in BinaryTreeOrder(range, mid + 1, last))
  {
    yield return item;
  }    
}

private static void BinaryTreeOrder<T>(IList<T> range, int first, int last, 
                                       ref IList<T> outList)
{
  if (first > last)
  {
    return;
  }

  int mid = (first + last) / 2;
  outList.Add(range[mid]);
  BinaryTreeOrder(range, first, mid - 1, ref outList);
  BinaryTreeOrder(range, mid + 1, last, ref outList);
}

private static void BinaryTreeOrder<T>(IList<T> range, int first, int last, 
                                       ref BinaryTree<T> tree) where T : IComparable<T>
{
  if (first > last)
  {
    return;
  }

  int mid = (first + last) / 2;
  tree.Add(range[mid]);
  BinaryTreeOrder(range, first, mid - 1, ref tree);
  BinaryTreeOrder(range, mid + 1, last, ref tree);
}

Upvotes: 0

mipadi
mipadi

Reputation: 410672

Are you dead-set on using a binary search tree? A Bloom filter would probably be a more efficient data structure.

Upvotes: 3

nickf
nickf

Reputation: 546055

If you're just trying to see if a particular word exists in your dictionary (that is, it's spelt correctly), then I don't think a binary search tree is what you're after. A better way to store that information would be in a tree style where each successive node on your tree is one character, and reading the path to the end node gives you the spelling of that word. You'd also need to add a marker to indicate a word-ending.

Eg: say your dictionary has these words: car, cart, cat, cup, cut

- C
  - A
    - R
      - end
      - T
    - T
      - end
  - U
    - P
      - end
    - T
      - end

Checking if a word exists is a matter of looking at each letter individually, and that it exists in the children of the current node.

Check for "cat"
Does "C" exist at the root level? Yes, move to the next letter.
Does "A" exist underneath C? Yes, move on.
Does "T" exist underneath A? Yes, move on.
Is there a word ending after the T? Yes. Word exists.

Check for "cu"
Does "C" exist at the root level? Yes, move to the next letter.
Does "U" exist at the root level? Yes, move to the next letter.
Is there a word ending after the U? No. Word does not exist.

How you store this information is up to you. As Steven pointed out, a Ternary Search Trie might be the way to go: each node would have 27 possible child nodes.

Upvotes: 1

Steve Jessop
Steve Jessop

Reputation: 279255

With the example you gave, performance is likely to be irrelevant, since on a PC the whole operation will take approximately 1% of the time it takes the user to read the first result you show, provided you don't use a completely stupid algorithm. But still, I'll assume the problem is big enough that performance is an issue.

If the dictionary file is presorted (as most are), and if the text is small relative to the dictionary as you describe, then I would be sorely tempted to sort the text, perhaps removing duplicates, and then iterate through both lists side-by-side using the same procedure as a merge sort, except you report whether each text word is in the dictionary instead of outputting a merged list.

This does the job in about M log M comparisons for the sort, plus at most N + M comparisons for the iteration, (possibly less, but not complexity-less). That's fairly close to optimal complexity for a one-off operation: to get rid of the linear term in N you need to find ways to not read the whole dictionary from disk at all. I'm pretty sure it's possible to bsearch into the file, especially given that words are quite short, but for small N it's anyone's guess whether seeking about the place will actually be faster than serially accessing the data.

It has the following characteristics:

  • You don't have to hold the dictionary in memory, only the text.
  • Nevertheless, you only make one pass over the dictionary file.
  • You don't do any expensive processing of the dictionary.

Of course if the dictionary file isn't pre-sorted then this doesn't work, and if you can keep the dictionary hanging around in memory for the next spellcheck operation then you can amortise the cost of I/O and of processing it into a tree across several different texts, which will be a win in the long run.

If the dictionary is really huge, then you might benefit from storing it on disk in a pre-processed form equivalent to an unbalanced tree weighted according to the relative frequencies of the various words in your language. Then you can do less than O(N) disk access for small texts, and on most OSs not bother loading it into memory at all, just mmap the file and let the OS worry about it. For a large dictionary, the entire clusters containing words beginning with "dimethyl" need never be touched.

Another consideration is a splay tree for the dictionary. A splay tree unbalances itself as you look things up in it, in order to make frequently-used values quicker to find. Most text uses a small number of words repeatedly, so if the text is long enough to justify the overhead this will win eventually.

Both of the above are subject to Steven A Lowe's point that for strings, a trie beats a normal tree. Don't know whether you'll find an off-the-shelf splay trie, though.

Upvotes: 0

seanb
seanb

Reputation: 6954

If you need to do an auto suggest/prefix search as well, then a patricia tree or radix tree is worth looking at.

Upvotes: 0

Steven A. Lowe
Steven A. Lowe

Reputation: 61233

a ternary tree trie would be more efficient

Upvotes: 8

Related Questions