rmacqueen
rmacqueen

Reputation: 1071

tesseract unable to detect characters in simple two-word image

I'm having trouble getting tesseract to recognize any characters in the following image:

enter image description here

When I run tesseract from the command line on this image, I get "Empty page!!" - that is, no results - returned. Based on my reading of the Improving Quality section of the wiki, I thought that the issue might be that the words in this image are not dictionary words. With that in mind, I have tried both disabling the tesseract dictionaries altogether (using the load_system_dawg and load_freq_dawg config flags) as well as augmenting the existing dictionary with these additional words (LAO and CAUD). Neither of those approaches worked. I have tried tesseract versions 3, 4, and have built version 5 from source on a Mac computer. All have given the same result.

Curiously, if I type the exact words from that image into a word processor and take a screenshot, it works: the resulting image is readable by tesseract. It correctly parses each character. Here is that image:

enter image description here

The only difference between the two images is that the first one is of a slightly lower resolution/quality. Am I then to believe that tesseract is unable to recognize characters in a slightly inferior quality image like that? Is there anything I can do to improve that image quality? Is there something else I'm missing?

Thanks in advance.

Upvotes: 1

Views: 2407

Answers (2)

rmacqueen
rmacqueen

Reputation: 1071

The solution was to use the right page segmentation method (PSM). In my case, PSM 6, which is for a single block of text, did the trick.

Upvotes: 1

Guinther Kovalski
Guinther Kovalski

Reputation: 1929

It's common problem. You probably will need preprocess the image, with rescaling, filters, etc.

Here are some ref on how to do that:

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

https://docparser.com/blog/improve-ocr-accuracy/

Upvotes: 2

Related Questions