Is there a list that makes the correlation between digits and visually-similar letters? Or vice-versa? (Unicode / Latin, OCR context)

Question

Having the following European driving license info extracted using Tesseract.js (below is not the exact text, but rather an outline of the format these data fields would have on such documents), I would like to write multiple regular expressions that match different data fields on the driving license (ordering numbers below correspond to the digit preceding the field on any European driving license; the rules for labelling the data fields of these documents can be checked on Wikipedia - also, some document specimen photos there):

surname (last name)
other names ( first name(s) )
date of birth
b date of expiry
ID drivingLicense
address

For instance, for the first name(s) I wrote this regex: 2\.?\s*[\p{Letter}\s]* The problem is that with such a name:

IOAN

my regex might not match because of the OCR seeing the O in IOAN as the digit 0. If I add the 0 to my regex like so: 2\.?\s*[\p{Letter}0\s]* then it will match; so the solution is simple for this scenario. Nonetheless, other times, I've had Bs mistaken for 8s or As mistaken for 4s. It might become even trickier when trying to account for other scenarios (consider all possible Unicode letter characters that could be mistaken for a digit).

Is there a list that makes the correlation between digits and their visually similar letter-counterparts? I am looking to hardcode these inside my regex for better matching. (correlation between digits and Unicode characters -not just the characters in the Latin alphabet- would help even better). Or is there any other better solution to account for all possible scenarios?

Some may think including a 0-9 in my regex inside the square brackets would be a viable solution; but something like below might happen, which does not satisfy my needs for the current scenario. Of course, I could do some whitespace parsing to account for that or even avoid the 3 digit that I know for certain will always follow on the next line by doing something like 2\.?\s*[\p{Letter}0-24-9\s]* - but still I think better solutions could be found.

Is there a list that makes the correlation between digits and visually-similar letters? Or vice-versa? (Unicode / Latin, OCR context)

Answers (0)

Related Questions