Reputation: 71
The problem:
There is a set of word S = {W1,W2.. Wn} where n < 10. This set just exists, we do not know its content. These words are drawn on some image and then recognized. The OCR algorytm is poor as well as dpi and as a result there are mistakes. So we have a second set of errorneous words S' = {W1',W2'..Wn'}
Now we have a word W that is a member of original set S. And now I need and algorythm which, given W and S', return index of the word in S'. most similar to W.
Example. S is {"alpha", "bravo", "charlie"}, S' is for example {"alPha","hravc","onarlio"} (these are real possible ocr erros).
So the target function should return F("alpha") => 0, F("bravo") => 1, F("charlie") => 2
I tried Levenshtein distance, but it does not work well, because it returns small numbers on small strings and OCRed string can be longer than original. Example if W' is {'hornist','cornrnunist'} and the given word is 'communist' the Levenshtein distance is 4 for the both words, but the right one is second.
Any suggestions?
Upvotes: 2
Views: 728
Reputation: 6040
As a zero approach, I'd suggest you to use the modification of Levenshtein distance algorithm with conditional cost of replacing/deleting/adding characters:
Distance(i, j) = min(Distance(i-1, j-1) + replace_cost(a.charAt(i), b.charAt(j)),
Distance(i-1, j ) + insert_cost(b.charAt(j)),
Distance(i , j-1) + delete_cost(a.charAt(i)))
You can implement function replace_cost
in such way, that it will returns small values for visually similar characters (and high values for visually different characters), e.g.:
// visually similar characters
replace_cost('o', '0') = 0.1
replace_cost('o', 'O') = 0.1
replace_cost('O', '0') = 0.1
...
// visually different characters
replace_cost('O', 'K') = 0.9
...
And the similar approach can be used for insert_cost
and delete_cost
(e.g. you may notice, that during the OCR - some characters are more likely to disappear than others).
Also, in case when approach from above is not enough for you, I'd suggest you to look at Noisy channel model - which is widely used for spelling correction (this subject described very well in Natural Language Processing course by Dan Jurafsky, Christopher Manning - "Week 2 - Spelling Correction").
Upvotes: 4
Reputation: 2748
This appears to be quite difficult to do because the misread strings are not necessarily textually similar to the input, which is why Levinshtein distance won't work for you. The words are visually corrupted, not simply mistyped. You could try creating a dataset of common errors (o => 0, l -> 1, e => o) and then do some sort of comparison based on that.
If you have access to the OCR algorithm, you could run that algorithm again on a much broader set of inputs (with known outputs) and train a neural network to recognize common errors. Then you could use that model to predict mistakes in your original dataset (maybe overkill for an array of only ten items).
Upvotes: 1