Reputation: 187
Having the following European driving license info extracted using Tesseract.js (below is not the exact text, but rather an outline of the format these data fields would have on such documents), I would like to write multiple regular expressions that match different data fields on the driving license (ordering numbers below correspond to the digit preceding the field on any European driving license; the rules for labelling the data fields of these documents can be checked on Wikipedia - also, some document specimen photos there):
For instance, for the first name(s) I wrote this regex: 2\.?\s*[\p{Letter}\s]*
The problem is that with such a name:
my regex might not match because of the OCR seeing the O
in IOAN
as the digit 0
. If I add the 0
to my regex like so: 2\.?\s*[\p{Letter}0\s]*
then it will match; so the solution is simple for this scenario. Nonetheless, other times, I've had B
s mistaken for 8
s or A
s mistaken for 4
s. It might become even trickier when trying to account for other scenarios (consider all possible Unicode letter characters that could be mistaken for a digit).
Is there a list that makes the correlation between digits and their visually similar letter-counterparts? I am looking to hardcode these inside my regex for better matching. (correlation between digits and Unicode characters -not just the characters in the Latin alphabet- would help even better). Or is there any other better solution to account for all possible scenarios?
Some may think including a 0-9
in my regex inside the square brackets would be a viable solution; but something like below might happen, which does not satisfy my needs for the current scenario. Of course, I could do some whitespace parsing to account for that or even avoid the 3
digit that I know for certain will always follow on the next line by doing something like 2\.?\s*[\p{Letter}0-24-9\s]*
- but still I think better solutions could be found.
Upvotes: 0
Views: 102