Reputation: 4636
I'm using OCR to read images and PDFs and afterwards I try to extract certain numbers out of it. In some cases, the OCR algorithms read zero as the letter "o".
The OCR gave me this string:
Siabicbnenl| 033-7 | _o3300.81086 42000.000002 20852.301017 1 82510000030694
Prerfasa afesad
If the OCR read it right it would have ben like this:
Siabicbnenl| 033-7 | _03300.81086 42000.000002 20852.301017 1 82510000030694
Prerfasa afesad
I want to catch that 03300.81086 42000.000002 20852.301017 1 82510000030694
My pattern (?s)\d{5}\.?\d{5}.*?\d{5}\.?\d{6}.*?\d{5}\.?\d{6}.*?\d.*?\d{14}
would have worked fine if the OCR had read it right, but here I got in contact with a new situation:
OCR confused zero with "o"
Is there a way to fix my pattern in order to also consider "o" as zero or I will need to make an if 'didnt find anything': str.replace("o",0)
and run it again?
Upvotes: 0
Views: 1196
Reputation: 23256
The character class \d
is equivalent to [0-9]
for ASCII input. If you want to include the lower-case "o" as well, you could use [0-9o]
everywhere you use \d
now.
If you expect that the input contains digit characters other than the ASCII 0 to 9, you can combine \d
with o
in a (capturing) group with two alternatives: (\d|o)
. If you like you can make it non-capturing, too: (?:\d|o)
.
Upvotes: 1