Reputation: 373
I am working on this spellchecker and one of the methods I use to suggest corrections to the user is inserting multiple characters into the word. This allows for words like exmpl
to be corrected to example
Here's the actual code:
public static Set<String> multiInsert(String word, int startIndex) {
Set<String> result = new HashSet<>();
//List of characters to insert
String correction = "azertyuiopqsdfghjklmwxcvbnùûüéèçàêë";
for (int i = startIndex; i <= word.length(); i++) {
for (int j = i + 1; j <= word.length(); j++) {
for (int k = 0; k < correction.length(); k++) {
String newWord = word.substring(0, j) + correction.charAt(k) + word.substring(j);
result.addAll(multiInsert(newWord, startIndex + 2));
if(dico.contains(newWord)) result.add(newWord);
}
}
}
return result;
}
The problem with this function is that it takes A LOT of time to process especially when the word is long or when I have too many words to correct. Is there any better way to implement this function or optimize it?
Upvotes: 0
Views: 430
Reputation: 40659
What's making it slow is you're testing for strings that are not in the dictionary. There are lots more possible misspellings than there are words in the dictionary. You need to be guided by the dictionary.
This is the general spelling correction problem. I've programmed it several times.
In a nutshell, the method is to store the dictionary as a trie, and do bounded depth-first walk of the trie. At each step, you keep track of the distance between the word in the trie and the original word. Whenever that distance exceeds the bound, you prune the search.
So you do it in cycles, increasing the bound each time. First you do it with a bound of 0, so it will only find an exact match. That is equivalent to ordinary trie search. If that did not yield a match, do the walk again with a bound of 1. That will find all dictionary words that are distance 1 from the original word. If that didn't yield any, increase the bound to 2, and so on. (What constitutes an increment of distance is any transformation you choose, like insertion, deletion, replacement, or more general re-writings.)
The performance is bounded by the true distance times the dictionary size. Short of that, it is exponential in the true distance. Since each walk costs a factor times the previous walk, the time is dominated by the final walk, so the prior walks do not add much time.
There is an advantage to organizing the dictionary as a trie, since a trie is just a particular form of Finite State Machine. You can add sub-machines to it to handle common prefixes and suffixes, without massively expanding the dictionary. Consider these words: nation, national, nationalism, nationalistic, nationalistical, nationalisticalism, ... Such words may not be common, but they are not impossible. A suffix trie handles them easily. Similarly prefixes like pre-, post-, un-, de-, in-, etc.
Upvotes: 2
Reputation: 965
You can have a look at Jazzy which is a Java spellchecker API.
You might also want to consider Fuzzy String Matching.
Upvotes: 0