gmaster
gmaster

Reputation: 692

Tesseract OCR gives bad output

I'm using a c# wrapper for the Tesseract library (3.02 if I'm not mistaken) (https://github.com/charlesw/tesseract). I've got it running and giving output, but that output is essentially garbage. Often it gives nothing and when it does give something it's often a mess. I know it's theoretically working because I've tried it on some really perfect images and it works. I'm wondering if someone can help me diagnose the issues and suggest some ways I can improve Tesseract accuracy. I've already converted all the images to black and white and the resolution is set at 300x300. I don't do any line straightening programmatically but as you can see below they're pretty straight.

enter image description here This image works perfectly

enter image description here This one does not work at all, producing either gibberish or nothing at all

I tried flipping the colors on some examples, thinking that it might give greater contrast (since most text is black on a white background, whereas the working ones were white text on black background). But:

enter image description here Does not work at all, whereas

enter image description here Again works perfectly.

I suspect this has something to do with the additional spacing between the letters in "INVOICE." But there must be some way to get decent results with a tighter font. Any suggestions are welcome, I'm a relative noob here.

Upvotes: 6

Views: 3786

Answers (2)

Fbi992
Fbi992

Reputation: 46

If possible you should consider using pictures with a higher resolution. The other problem about the Payments image is probably the gap between the letters that is too small. Tesseract cannot detect single letters if they are (almost) connected to the next letter of the word. I would suggest an image processing library like openCV to improve your results. You could try erosion/dilation. This will seperate the letters if the right parameters are used for the kernel. Use different kernels to see what works best for you.

Mat element = getStructuringElement(erosion_type,
                                   Size(2 * erosion_size + 1, 2 * erosion_size + 1),
                                   Point(erosion_size, erosion_size));

erode(src, erosion_dst, element);

What was helping me a lot when I was working on my project was using an adaptive threshold. I found this to be way more effective than just turning it into a grayscale or binary image. Note: Java Code, should be very similar in C though.

Imgproc.adaptiveThreshold(cropedIm, cropedIm, 255, Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C, Imgproc.THRESH_BINARY, 29, 10);

This is what I get after selecting one of your images in Pixtern, an android project of mine(source code on github). I was using a the adapting threshold but no dilation/erosion and the result is already quite good.

[broken links removed]

For the Payments image and similar ones: Try using a normal threshold and inverting the image(black font, white background). Again, dilation/erosion can be used afterwards. Java Code:

//results in binary image
Imgproc.threshold(cropedIm, cropedIm, 127, 255, Imgproc.THRESH_BINARY);
//Inverting image
Core.bitwise_not(cropedIm, cropedIm);

Upvotes: 3

Hooloovoo
Hooloovoo

Reputation: 96

Tesseract expects whole pages or rather it was trained on those. If you give it one or two characters or words it won't work well.

I assume you have more of these images. Stitch them together as lines of text: like each image is a line of text after the previous and it should work much better.


Furthermore, make sure you set the psm-parameter right when using tesseract. More on this: https://www.pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

Upvotes: 2

Related Questions