Alexandra
Alexandra

Reputation: 85

Tesseract misinterprets letters in an invoice

I am using Tesseract 4.0 and I am trying to OCR some invoices. My problem is that it gives wrong results for some letters, for example I will get a $ or an 8 when the letter is actually S.

The weird things is that some S's are guessed correctly, but some S's or not, and this applies to other letters as well.

My question is, how can I train Tesseract to handle these cases better?

Also, I was wonderinf if Tesseract misinterprets S in S.A. as being a number because of the dots.

I have attached the image that I am having problems with.

Thanks,

Alexandra

Upvotes: 0

Views: 693

Answers (2)

Dmitrii Z.
Dmitrii Z.

Reputation: 2357

What you should do is apply some preprocessing stages. Since your font is quite noisy - simple erosion and dilation would give a better input image: enter image description here

erode(image, image, getStructuringElement(MORPH_RECT, Size(2, 4)));
dilate(image, image, getStructuringElement(MORPH_RECT, Size(4, 4)));

Output for that image is

S.C. Carpatcement Hording S.A.

By the way i noticed that if you would use OEM_TESSERACT_ONLY (no LSTM) on the initial image it would give correct results for initial image as well as preprocessed one.

Upvotes: 0

Massimo Di Saggio
Massimo Di Saggio

Reputation: 341

You can't really "train" tesseract. What you can do is tweaking the contrast and/or brightness of the picture you pass it in order to get better results. Tesseract also allows you to specify the language your text is in with the -l option, although I couldn't really tell an improvement in its accuracy, but your mileage may vary.

Upvotes: 1

Related Questions