Reputation: 85
I am using Tesseract 4.0 and I am trying to OCR some invoices. My problem is that it gives wrong results for some letters, for example I will get a $ or an 8 when the letter is actually S.
The weird things is that some S's are guessed correctly, but some S's or not, and this applies to other letters as well.
My question is, how can I train Tesseract to handle these cases better?
Also, I was wonderinf if Tesseract misinterprets S in S.A. as being a number because of the dots.
I have attached the image that I am having problems with.
Thanks,
Alexandra
Upvotes: 0
Views: 693
Reputation: 2357
What you should do is apply some preprocessing stages.
Since your font is quite noisy - simple erosion and dilation would give a better input image:
erode(image, image, getStructuringElement(MORPH_RECT, Size(2, 4)));
dilate(image, image, getStructuringElement(MORPH_RECT, Size(4, 4)));
Output for that image is
S.C. Carpatcement Hording S.A.
By the way i noticed that if you would use OEM_TESSERACT_ONLY (no LSTM) on the initial image it would give correct results for initial image as well as preprocessed one.
Upvotes: 0
Reputation: 341
You can't really "train" tesseract. What you can do is tweaking the contrast and/or brightness of the picture you pass it in order to get better results. Tesseract also allows you to specify the language your text is in with the -l option, although I couldn't really tell an improvement in its accuracy, but your mileage may vary.
Upvotes: 1