Reputation: 1247
I use tesseract OCR to extract some text from different documents, then I process the extracted text with Regex to see if it matches a specific pattern. Unfortunately, OCR extraction makes common mistakes on ambiguous characters, such as: 5: S, 1: I, 0: O, 2: Z, 4: A, 8: B, etc.. These mistakes are so common that substituting the ambiguous characters would match the pattern perfectly.
Is there a way to postprocess OCR extraction and substitute ambiguous characters (provided in advance) by following a specific pattern?
expected output (and what I could think of so far):
# example: I am extracting car plate numbers that always follow patern [A-Z]{2}\d{5}
# patterns might differ for other example, but will always be some alfa-numeric combination
# complex patterns may be ignored with some warning like "unable to parse"
import re
def post_process(pattern, text, ambiguous_dict):
# get text[0], check pattern
# in this case, should be letter, if no, try to replace from dict, if yes, pass
# continue with next letters until a match is found or looped the whole text
if match:
return match
else:
# some error message
return None
ambiguous_dict = {'2': 'Z', 'B': '8'}
# My plate photo text: AZ45287
# Noise is fairly easy to filter out by filtering on tesseract confidence level, although not ideal
# so, if a function cannot be made that would find a match through the noise
# the noise can be ignored in favor of a simpler fucntion that can just find a match
ocr_output = "someNoise A2452B7 no1Ze"
# 2 in position 1is replaced by Z, B is replaced by 8. It would be acceptable if the function will
# while '2' on pos 5 should remain a 2 as per pattern
# do this iteratively for each element of ocr_output until pattern is matched or return None
# Any other functionally similar (recursive, generator, other) approach is also acceptable.
result = post_process(r"[A-Z]{2}\d{5}", ocr_output, ambiguous_dict)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
I hope I explained my problem well, but fell free to request additional info
Upvotes: 1
Views: 848
Reputation: 627292
As always with OCR, it is hard to come up with a 100% safe and working solution. In this case, what you can do, is to add the "corrupt" chars to the regex and then "normalize" the matches using the dictionaries with replacements.
It means that you just can't use [A-Z]{2}\d{5}
because among the first two uppercase letters there can be an 8
, and among the five digits there can be a B
. Thus, you need to change the pattern to ([A-Z2]{2})([\dB]{5})
here. Note the capturing parentheses that create two subgroups. To normalize each, you need two separate replacements, as it appears you do not want to replace digits with letters in the numeric part (\d{5}
) and letters with digits in the letter part ([A-Z]{2}
).
So, here is how it can be implemented in Python:
import re
def post_process(pattern, text, ambiguous_dict_1, ambiguous_dict_2):
matches = list(re.finditer(pattern, text))
if len(matches):
return [f"{x.group(1).translate(ambiguous_dict_1)}{x.group(2).translate(ambiguous_dict_2)}" for x in matches]
else:
return None
ambiguous_dict_1 = {ord('2'): 'Z'} # For the first group
ambiguous_dict_2 = {ord('B'): '8'} # For the second group
ocr_output = "someNoise A2452B7 no1Ze"
result = post_process(r"([A-Z2]{2})([\dB]{5})", ocr_output, ambiguous_dict_1, ambiguous_dict_2)
if result:
print(result) # AZ45287
else: # result is none
print("failed to clean output")
# => ['AZ45287']
See the Python demo
The ambiguous_dict_1
dictionary contains the digit to letter replacements and ambiguous_dict_2
contains the letter to digit replacements.
Upvotes: 1