Chris Nielsen
Chris Nielsen

Reputation: 14738

Spelling Suggestions for Conjoined Words

I am working on implementing a spell check function for a web-based WYSIWYG editor. I am currently using the Damerau-Levenshtein distance algorithm to produce a list of spelling suggestions. This is all working out nicely, but I am curious as to how I might improve the functionality.

Specifically, my implementation does not currently handle conjoined words. For instance, I would like to be able to detect "areyou" and suggest "are you" instead. I think I can do this by breaking the potentially conjoined word apart at likely looking segments and testing both halves. Since all English words must have at least one vowel, I think I can look for vowels to help me decide where to break words apart.

The Damerau-Levenshtein distance algorithm was so useful; it is clear that others have put a lot more thought into this than I have. Is there a similarly clever algorithm that I should consider for detecting conjoined words, or am I on the right track already?

Upvotes: 4

Views: 911

Answers (3)

Nick Johnson
Nick Johnson

Reputation: 101139

Check out this excellent article on writing a spelling checker. Using that technique, you have two options: Either include every pair of words, or every likely pair of words in the dictionary (with the separated words as the solution), or try every possible split point and do a standard dictionary lookup to see if both words are valid.

Upvotes: 1

Robert Obryk
Robert Obryk

Reputation: 2539

As you are already reading the whole dictionary for every word, it wouldn't be terribly inefficient to append common pairs of words to the dictionary. Alternatively, you can divide the input (possibly conjoined word) into two words in all possible ways and then look for words near each of them in the dictionary. It isn't as slow as it sounds -- you can use DL intermediate results of a word to get the results for its prefix.

Upvotes: 1

Heath Hunnicutt
Heath Hunnicutt

Reputation: 19457

I imagine the candidate conjoined word will not be longer than forty (40) characters or so; most of the time it will be less than ten (10).

Considering the small size, what about this pseudocode?

if (is_spelled_wrong(word)):
    N = len(word)
    list_suggestions = []
    for i = 1 to N-1:
        wordA = word[0:i] // Pythonic 'slice' notation
        wordB = word[i+1:N]
        if (!is_spelled_wrong(wordA) && !is_spelled_wrong(wordB))
            list_suggestions.appened((wordA, wordB))

In other words, just scan the string for all possibilities. There are a small number of them. In the case of "areyou", you would loop five (5) times.

Upvotes: 3

Related Questions