Reputation: 10584
I am trying to extract numbers from images. I test tesseract-OCR, but the result is not good enough. For example,
tesseract test.jpg stdout --psm 6
will output:
4367 42424W0 104
I assume the issue is due to there is some background images in the back of the words. Is there any way that can improve the result?
Upvotes: 1
Views: 768
Reputation: 8626
You may use the convert
command of ImageMagick
to threshold the image to back-in-white. You can download ImageMagick
here, it supports multiple platform.
By typing,
convert image.jpg -threshold 33% thresholded.jpg
It outputs the image below. The threshold value is obtained after few attempts and adjustments.
Then, with the basic tesseract
command it gives a correct output.
If the image only consists of 0-9, you may enable the tesseract option to improve the recognition accuracy - -c tessedit_char_whitelist=01234567890"
.
Hope this help.
Upvotes: 3