how to improve the result of tesseract when the words has background image

Question

I am trying to extract numbers from images. I test tesseract-OCR, but the result is not good enough. For example,

tesseract test.jpg stdout --psm 6

will output:

4367 42424W0 104

I assume the issue is due to there is some background images in the back of the words. Is there any way that can improve the result?

thewaywewere · Accepted Answer

You may use the convert command of ImageMagick to threshold the image to back-in-white. You can download ImageMagick here, it supports multiple platform.

By typing,

convert image.jpg -threshold 33% thresholded.jpg

It outputs the image below. The threshold value is obtained after few attempts and adjustments.

Then, with the basic tesseract command it gives a correct output.

If the image only consists of 0-9, you may enable the tesseract option to improve the recognition accuracy - -c tessedit_char_whitelist=01234567890".

Hope this help.

how to improve the result of tesseract when the words has background image

Answers (1)

Related Questions