Altair7852
Altair7852

Reputation: 1418

OCR specific approximate string matching library

I have a text extracted from image using OCR. Some of the words are not correctly recognized in the text as follows:

'DRDER 0F OFF1CE RESTAURAUT, QNE THO...'

As you can see optically some characters is easy to mix for others: 1 -> I, O -> D -> Q, H -> W, U -> N and so on.

Question: Apart from standard algorithms like Levenshtein distance, is there a Java or Python library implementing OCR specific algorithm that can help compare words to a predefined dictionary and give a score, taking into account possible OCR character mixups?

Upvotes: 1

Views: 1292

Answers (1)

interfect
interfect

Reputation: 2877

I don't know of anything OCR-specific, but you might be able to make this work with Biopython, because the basic problem of comparing one string to another using a matrix that scores each character's similarity to every other character is very common in bioinformatics. We call it a sequence alignment problem.

Have a look at the pairwise2 module that Biopython provides; you would be able to compare each input word against each dictionary word with pairwise2.align.globaldx, using a dict that has all the pairwise character similarities. There are also functions in there for scoring deleted/inserted characters.

Computing the pairwise character similarities would be something you'd have to do yourself, maybe by rendering each character in your chosen font and comparing the images, or maybe manually by just rating which characters look similar to you. You could also have a look at this other SO answer where characters are broken into classes based on the presence/absence of strokes.

If you want something better than O(input * dictionary), you'd have to switch from brute force comparison to some kind of seed-match-based algorithm. If you assume that you'll always have a 2-character perfect match for example, you can index your dictionary by which words contain each length-2 string, and only compare the input words against the dictionary words that share a length-2 string with them.

Upvotes: 1

Related Questions