Reputation: 33
I'm trying to use Tesseract OCR to extract a string of characters (not a valid word) from an image. The issue is that the characters in the image are spaced out, like in the picture below.
With default properties, this image is recognized as
5 O M E T E—E X fT
.
I tried to tinker with the page segmentation properties, but the closest I got is "SOME TEXT.
with --psm 8
. I'm wondering if there is a setting that will enable Tesseract to better deal with the spacing in between the letters, or if I need to train a custom model.
Upvotes: 1
Views: 946
Reputation: 8005
1st way is resizing the image.
If you resize the image (0.15, 0.15)
With default properties you will get:
S O M E T E X T
Code:
import cv2
import pytesseract
bgr = cv2.imread("BdgJJ.png")
gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(gray, (0, 0), fx=.15, fy=.15)
text = pytesseract.image_to_string(resized)
print(text)
2nd way is using adaptive threshold
If you apply adaptive threshold:
With psm mode 6, result will be:
S O M E T E X T
Code:
import cv2
import pytesseract
bgr = cv2.imread("BdgJJ.png")
gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 21, 2)
text = pytesseract.image_to_string(thresh, config="--psm 6")
print(' '.join(ch for ch in text if ch.isalnum()).upper()[:-1])
Upvotes: 1