Tesseract OCR gives bad output

Question

I'm using a c# wrapper for the Tesseract library (3.02 if I'm not mistaken) (https://github.com/charlesw/tesseract). I've got it running and giving output, but that output is essentially garbage. Often it gives nothing and when it does give something it's often a mess. I know it's theoretically working because I've tried it on some really perfect images and it works. I'm wondering if someone can help me diagnose the issues and suggest some ways I can improve Tesseract accuracy. I've already converted all the images to black and white and the resolution is set at 300x300. I don't do any line straightening programmatically but as you can see below they're pretty straight.

This image works perfectly

This one does not work at all, producing either gibberish or nothing at all

I tried flipping the colors on some examples, thinking that it might give greater contrast (since most text is black on a white background, whereas the working ones were white text on black background). But:

Does not work at all, whereas

Again works perfectly.

I suspect this has something to do with the additional spacing between the letters in "INVOICE." But there must be some way to get decent results with a tighter font. Any suggestions are welcome, I'm a relative noob here.

Fbi992 · Accepted Answer

If possible you should consider using pictures with a higher resolution. The other problem about the Payments image is probably the gap between the letters that is too small. Tesseract cannot detect single letters if they are (almost) connected to the next letter of the word. I would suggest an image processing library like openCV to improve your results. You could try erosion/dilation. This will seperate the letters if the right parameters are used for the kernel. Use different kernels to see what works best for you.

Mat element = getStructuringElement(erosion_type,
                                   Size(2 * erosion_size + 1, 2 * erosion_size + 1),
                                   Point(erosion_size, erosion_size));

erode(src, erosion_dst, element);

What was helping me a lot when I was working on my project was using an adaptive threshold. I found this to be way more effective than just turning it into a grayscale or binary image. Note: Java Code, should be very similar in C though.

Imgproc.adaptiveThreshold(cropedIm, cropedIm, 255, Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C, Imgproc.THRESH_BINARY, 29, 10);

This is what I get after selecting one of your images in Pixtern, an android project of mine(source code on github). I was using a the adapting threshold but no dilation/erosion and the result is already quite good.

[broken links removed]

For the Payments image and similar ones: Try using a normal threshold and inverting the image(black font, white background). Again, dilation/erosion can be used afterwards. Java Code:

//results in binary image
Imgproc.threshold(cropedIm, cropedIm, 127, 255, Imgproc.THRESH_BINARY);
//Inverting image
Core.bitwise_not(cropedIm, cropedIm);

Tesseract OCR gives bad output

Answers (2)

Related Questions