How can I fix letter confusion in python-tesseract OCR?

I am trying to extract VAT invoice numbers with OCR and regex, but many times the letter B is confused with the number 8. For example the VAT number is B28125185 and the OCR returns 828125185. And of course the regex does not detect the VAT number. I have read something about Levenshtein distance but I don't know how I could implement it. Is there a way to solve this problem?

thanks

Upvotes: 0

Views: 2435

Answers (1)

K41F4r
K41F4r

Reputation: 1551

If the image you're using has a specific font you could look into training a model for your needs. Here's a video that describes the process: https://www.youtube.com/watch?v=TpD76k2HYms

Alternatively, you could try training on images - feeding images of VAT numbers + their text, to teach tesseract how they look like.

Here's a link to the documentation for training:

https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#tutorial-guide-to-lstmtraining

Upvotes: 1

Related Questions