jfisk
jfisk

Reputation: 6215

creating a list of possible correct spellings from an incorrectly spelled word

Im doing a pretty cool homework assignment where by using a Dictionary D with text T, im supposed to scan the text T and for every word in T not in D, generate a list of possible correct spelling by doing at least one of the following common misspellings: swapping two adjacent characters, inserting an extra character, deleting a single character, and replacing a character for another.

Im unsure how to go about the last part, but here is what i have so far:

1.) use any one of the java methods to separate each word into an entry in a string array I. 2.) use a for loop with index k to go to each entry in I and use the get(k) to see if that word exists in our dictionary. if it doesn't, add that word to another string array MisspelledWords[].

3.) how could I efficiently do one of those common misspelling checks? Right now i can only think of things that would be highly inefficient, like arbitrarily changing the last letter or something.

Thanks!

Upvotes: 4

Views: 2061

Answers (5)

gcooney
gcooney

Reputation: 1699

Write an algorithm that compares two words and uses the rules your teacher gave you to determine if one word can be a misspelling for the other. If you get stuck look at Damerau-Levenshtein distance which is a modified Levenshtein distance that takes transposed characters into account.

You'll then need to use that algorithm to compare each misspelled word to every word in your dictionary. One hint to make the algorithm more efficient: You can eliminate a lot of words as candidates without calculating a distance function(Think about the minimum distance of an n character word from an n+2 character word).

Upvotes: 0

sethu
sethu

Reputation: 1721

The Answer I could think of for doing this effectively is to use sort. Basically this is the approach taken to find out anagrams.

  • You take your dictionary , sort every word and have it in a hash. eg. Dog, God - > after sorting - > DGO. So in DGO u will have Dog and god (liked chained hash).

  • Now you sort your misspelled word and find where there is a match. And you compare character by character with all other words that fall in the same bucket.

Caveat :

If there are letter missing , this cannot detect. Maybe while comparing the misspelled word , we can just consider any bucket that has all the letters . (eg) if you are searching for good , and you have god (lets hope there is no word god) . You will sort and have dgo. You can look at any bucket that has more than half of the letters .In this case 2 letters.

Once you create that hash (One time cost) , your efficiency will be good. Please let me know if there can be any improvements made on this .

Upvotes: 0

joe_coolish
joe_coolish

Reputation: 7259

I had a school project that was very similar to this one.

The basic theory is that you want to calculate the Levenshtein Distance between the word T and all of the words in dictionary D. Then you present the top X results, where the lower the distance the better.

I do agree that this project was one of my favorites. One of the interesting features that I found was that there was a particular symmetry in the resulting table, which allows for easy multi-threading of the algorithm.

Good luck!

Upvotes: 1

Marc B
Marc B

Reputation: 360772

Basically you want to calculate the Levenshtein Distance between your 'bad' word and every word in the dictionary. It's not a "cheap" process, computationally, but it'll let you detect the simple transpositions/one-char differences easily.

In short, the L.D. is the number of "steps" required to transform one string into another by adding/removing/changing only a single character in each step.

color / colour = LD of 1
mad / min = LD of 2 ( mad -> man -> min

Upvotes: 1

Rom1
Rom1

Reputation: 3207

Well a couple of pointers to get you started. If you want to store and retrieve words with common prefixes efficiently, try a prefix tree. For the spell checking part, read up on edit distance.

Also, for a simple but practical and very well explained (and short!) implementation, see this article by Norvig.

Upvotes: 1

Related Questions