aabujamra
aabujamra

Reputation: 4636

OCR confusing zero with "o" - how to specify zero or letter "o" in python regex?

I'm using OCR to read images and PDFs and afterwards I try to extract certain numbers out of it. In some cases, the OCR algorithms read zero as the letter "o".

The OCR gave me this string:

Siabicbnenl| 033-7 | _o3300.81086 42000.000002 20852.301017 1 82510000030694



Prerfasa afesad

If the OCR read it right it would have ben like this:

Siabicbnenl| 033-7 | _03300.81086 42000.000002 20852.301017 1 82510000030694



Prerfasa afesad

I want to catch that 03300.81086 42000.000002 20852.301017 1 82510000030694

My pattern (?s)\d{5}\.?\d{5}.*?\d{5}\.?\d{6}.*?\d{5}\.?\d{6}.*?\d.*?\d{14} would have worked fine if the OCR had read it right, but here I got in contact with a new situation:

OCR confused zero with "o"

Is there a way to fix my pattern in order to also consider "o" as zero or I will need to make an if 'didnt find anything': str.replace("o",0) and run it again?

Upvotes: 0

Views: 1196

Answers (1)

mkrieger1
mkrieger1

Reputation: 23256

The character class \d is equivalent to [0-9] for ASCII input. If you want to include the lower-case "o" as well, you could use [0-9o] everywhere you use \d now.

If you expect that the input contains digit characters other than the ASCII 0 to 9, you can combine \d with o in a (capturing) group with two alternatives: (\d|o). If you like you can make it non-capturing, too: (?:\d|o).

Upvotes: 1

Related Questions