Reputation: 14738
I am working on implementing a spell check function for a web-based WYSIWYG editor. I am currently using the Damerau-Levenshtein distance algorithm to produce a list of spelling suggestions. This is all working out nicely, but I am curious as to how I might improve the functionality.
Specifically, my implementation does not currently handle conjoined words. For instance, I would like to be able to detect "areyou" and suggest "are you" instead. I think I can do this by breaking the potentially conjoined word apart at likely looking segments and testing both halves. Since all English words must have at least one vowel, I think I can look for vowels to help me decide where to break words apart.
The Damerau-Levenshtein distance algorithm was so useful; it is clear that others have put a lot more thought into this than I have. Is there a similarly clever algorithm that I should consider for detecting conjoined words, or am I on the right track already?
Upvotes: 4
Views: 911
Reputation: 101139
Check out this excellent article on writing a spelling checker. Using that technique, you have two options: Either include every pair of words, or every likely pair of words in the dictionary (with the separated words as the solution), or try every possible split point and do a standard dictionary lookup to see if both words are valid.
Upvotes: 1
Reputation: 2539
As you are already reading the whole dictionary for every word, it wouldn't be terribly inefficient to append common pairs of words to the dictionary. Alternatively, you can divide the input (possibly conjoined word) into two words in all possible ways and then look for words near each of them in the dictionary. It isn't as slow as it sounds -- you can use DL intermediate results of a word to get the results for its prefix.
Upvotes: 1
Reputation: 19457
I imagine the candidate conjoined word will not be longer than forty (40) characters or so; most of the time it will be less than ten (10).
Considering the small size, what about this pseudocode?
if (is_spelled_wrong(word)): N = len(word) list_suggestions = [] for i = 1 to N-1: wordA = word[0:i] // Pythonic 'slice' notation wordB = word[i+1:N] if (!is_spelled_wrong(wordA) && !is_spelled_wrong(wordB)) list_suggestions.appened((wordA, wordB))
In other words, just scan the string for all possibilities. There are a small number of them. In the case of "areyou", you would loop five (5) times.
Upvotes: 3