Vivek Sharma
Vivek Sharma

Reputation: 3814

Algorithms and Data Structures best suited for a spell checker, dictionary and a thesaurus

Best way to implement a

When asked in a one hour interview, are we expected to write a c/c++ code, for the algorithm?

Upvotes: 11

Views: 10282

Answers (6)

Tobias Langner
Tobias Langner

Reputation: 10828

I can see no better data structure than a trie for the dictionary and the thesaurus. Both can be fitted in one structure if needed with one link in the node pointing to the meaning of the word (dictionary) and one to synonyms (thesaurus). It can even offer some form of autocompletion (when there's only one link in the node).

Spelling corrector is a bit trickier - since one has to map fals input to some kind of correct input. You can take this link as a start: http://en.wikipedia.org/wiki/Spell_checker. At the end you'll find links to papers about different algorithms. According to the wikipedia article, this paper describes the most successfull algorithm: Andrew Golding and Dan Roth's "Winnow-based spelling correction algorithm"

Upvotes: 2

Grynszpan
Grynszpan

Reputation: 147

For the dictionary, there is indeed a data structure superior to the trie. Try a DAWG, or CDAWG: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph. Just to complicate matters, my favorite paper on the structure, Ciura and Deorowicz's "How to Squeeze a Lexicon" calls them "minimal ADFAs". Google around and you'll find many competing algorithms for constructing these beasts. Good luck!

Upvotes: 4

Nick Johnson
Nick Johnson

Reputation: 101149

In all three cases, you can construct a BK-Tree from your word set. BK-Trees let you find all words within a given edit distance of the entered word. See my blog post on BK-Trees for an explanation of how they work.

A dictionary and spell checker are more or less identical - the dictionary simply needs to provide definitions along with the words. For a thesaurus, words are clustered into 'synsets' - synonym sets - with all the elements inserted into the BK-Tree. When someone looks up one word in the synset, you display all the other ones as alternatives. A word can appear in multiple synsets, so you need to ensure your BK-Tree nodes can handle multiple values for a given key.

Upvotes: 1

Patrice Bernassola
Patrice Bernassola

Reputation: 14446

For a dictionary I would use the std::map (calling Dictionary in the .Net framework) collection with the word as key and a custom object (with all information about the word + the definition) as value.

For a thesaurus the best structure is a tree where each node is a section and where each branch finished with an object which contains all information about what you have to display.

Upvotes: 1

Anonymous
Anonymous

Reputation: 18631

Also check out the Bloom filter.

Upvotes: 1

Anton Gogolev
Anton Gogolev

Reputation: 115897

See this for a 21-line Python 2.5 spelling corrector and a bit of background.

Upvotes: 6

Related Questions