Reputation: 7212
I'm sure more than a few of you will have seen the Google Wave demonstration. I was wondering about the spell checking technology specificially. How revolutionary is a spell checker which works by figuring out where a word appears contextually within a sentence to make these suggestions ?
I haven't seen this technique before, but are there examples of this elsewhere?
and if so are there code examples and literature into its workings ?
Upvotes: 2
Views: 1568
Reputation: 11907
There are a lot of papers on this subject. Here are some good resources
This doesn't use context sensitivity, but it's a good base to build from http://norvig.com/spell-correct.html
This is probably a good and easy to understand view of a more powerful spell checker http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Cucerzan.pdf
From here you can dive deep on the particulars. I'd recommend using google scholar and looking up the references in the paper above, and searching for 'spelling correction'
Upvotes: 1
Reputation:
You can learn all about topics like this by diving into natural language processing. You can even go as in-depth as making a statistical guess as to which word will come next after a string of given words.
If you are interested in such a topic, I highly suggest using the NLTK (natural language toolkit) written entirely in python. it is a very expansive work, having many tools and pretty good documentation.
Upvotes: 1
Reputation:
You should also watch an official video by Casey Whitelaw of the Google Wave team that describes the techniques used: http://www.youtube.com/watch?v=Sx3Fpw0XCXk
Upvotes: 1
Reputation: 636
My 2 cents. Given the fact that translate.google.com is a statistical machine translation engine and "The Unreasonable Effectiveness of Data" from A Halevy, P Norvig (Director of Research at Google) & F Pereira: I make the assumption (bet) that this is a statistically driven spell checker.
How it could work: you collect a very large corpus of the language you want to spell check. You store this corpus as phrase-tables in adapted datastructures (suffix arrays for example if you have to count the n-grams subsets) that keep track of the count (an so an estimated probability of) the number of n-grams.
For example, if your corpus is only constitued of:
I had bean soup last diner.
From this entry, you will generate the following bi-grams (sets of 2 words):
I had, had bean, bean soup, soup last, last diner
and the tri-grams (sets of 3 words):
I had bean, had bean soup, bean soup last, soup last diner
But they will be pruned by tests of statistical relevance, for example: we can assume that the tri-gram
I had bean
will disappear of the phrase-table.
Now, spell checking is only going to look is this big phrase-tables and check the "probabilities". (You need a good infrastructure to store this phrase-tables in an efficient data structure and in RAM, Google has it for translate.google.com, why not for that ? It's easier than statistical machine translation.)
Ex: you type
I had been soup
and in the phrase-table there is a
had bean soup
tri-gram with a much higher probability than what you just typed! Indeed, you only need to change one word (this is a "not so distant" tri-gram) to have a tri-gram with a much higher probability. There should be an evaluating function dealing with the trade-off distance/probability. This distance could even be calculated in terms of characters: we are doing spell checking, not machine translation.
This is only my hypothetical opinion. ;)
Upvotes: 11