Reputation: 238
I have a trained OCR model that reads specific fonts. Some of these fonts have identical-appearing characters like 1's and capital i's, and so occasionally when wordlist predicting fails, I'll get inappropriate I's where 1's should be and 1's where I's should be.
In my case, I know that there should never be...
1NDEPENDENCE DAY
45I OZ
I% OFF
TEMP: -I DEGREES
TIME: I TO 5 PM
II A.M.
This is my 1st attempt which addresses some of these cases, but I'm sure there's a more efficient way to do this. Maybe looping over list of regex expressions with re.sub)?
import re
ocr_output = "TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M."
while True:
x = re.search("[\d+-]I", ocr_output)
if x:
ocr_output = ocr_output[:x.start()+1] + '1' + ocr_output[x.start() + 2:]
else:
break
while True:
x = re.search("I[\d%-]", ocr_output)
if x:
ocr_output = ocr_output[:x.start()] + '1' + ocr_output[x.start() + 1:]
else:
break
while True:
x = re.search("[A-Z]1", ocr_output)
if x:
ocr_output = ocr_output[:x.start()+1] + 'I' + ocr_output[x.start() + 2:]
else:
break
while True:
x = re.search("1[A-Z]", ocr_output)
if x:
ocr_output = ocr_output[:x.start()] + 'I' + ocr_output[x.start() + 1:]
else:
break
print(ocr_output)
>>>TIME: TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, II A.M.
What more elegant solutions can you think of to correct my OCR output for these cases? I'm working in python. Thank you!
Upvotes: 1
Views: 310
Reputation: 1273
I would use these kind of regexes for the general cases. Don't forget to precompile the regexes for performance reasons.
import re
I_regex = re.compile(r"(?<=[%I0-9\-+])I|I(?=[%I0-9])")
One_regex = re.compile(r"(?<=[A-Z])1|1(?=[A-Z])|1(?=[a-z])")
def preprocess(text):
output = I_regex.sub('1', text)
output = One_regex.sub('I', output)
return output
Output:
>>> preprocess('TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M.')
'TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, 11 A.M.'
Upvotes: 1
Reputation: 83
this is what i came up with:
def preprocess_ocr_output(text: str) -> str:
output = text
output = re.sub(r"1(?![\s%])(?=\w+)", "I", output)
output = re.sub(r"(?<=\w)(?<![\s+\-])1", "I", output)
output = re.sub(r"I(?!\s)(?=[\d%])", "1", output)
output = re.sub(r"(?<=[+\-\d])(?<!\s)I", "1", output)
return output
a solitary I--these will all be 1's; e.g, TIME: I to 5 PM
I don't think u can save that floating "I" -> "1" without causing problems else where...
Upvotes: 2