camino
camino

Reputation: 10584

how to improve the result of tesseract when the words has background image

I am trying to extract numbers from images. I test tesseract-OCR, but the result is not good enough. For example,

tesseract test.jpg stdout --psm 6

enter image description here

will output:

4367 42424W0 104

I assume the issue is due to there is some background images in the back of the words. Is there any way that can improve the result?

Upvotes: 1

Views: 768

Answers (1)

thewaywewere
thewaywewere

Reputation: 8626

You may use the convert command of ImageMagick to threshold the image to back-in-white. You can download ImageMagick here, it supports multiple platform.

By typing,

convert image.jpg -threshold 33% thresholded.jpg

It outputs the image below. The threshold value is obtained after few attempts and adjustments.

enter image description here

Then, with the basic tesseract command it gives a correct output.

enter image description here

If the image only consists of 0-9, you may enable the tesseract option to improve the recognition accuracy - -c tessedit_char_whitelist=01234567890".

Hope this help.

Upvotes: 3

Related Questions