Reputation: 83
I am trying to extract VAT invoice numbers with OCR and regex, but many times the letter B is confused with the number 8. For example the VAT number is B28125185 and the OCR returns 828125185. And of course the regex does not detect the VAT number. I have read something about Levenshtein distance but I don't know how I could implement it. Is there a way to solve this problem?
thanks
Upvotes: 0
Views: 2435
Reputation: 1551
If the image you're using has a specific font you could look into training a model for your needs. Here's a video that describes the process: https://www.youtube.com/watch?v=TpD76k2HYms
Alternatively, you could try training on images - feeding images of VAT numbers + their text, to teach tesseract how they look like.
Here's a link to the documentation for training:
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#tutorial-guide-to-lstmtraining
Upvotes: 1