Tesseract OCR is inaccurate for images with letter spacing

I'm trying to use Tesseract OCR to extract a string of characters (not a valid word) from an image. The issue is that the characters in the image are spaced out, like in the picture below. With default properties, this image is recognized as 5 O M E T E—E X fT.

I tried to tinker with the page segmentation properties, but the closest I got is "SOME TEXT. with --psm 8. I'm wondering if there is a setting that will enable Tesseract to better deal with the spacing in between the letters, or if I need to train a custom model.

Upvotes: 1

Answers (1)

Ahx

Reputation: 8005

1st way is resizing the image.

If you resize the image (0.15, 0.15)

With default properties you will get:

S O M E T E X T

Code:

import cv2
import pytesseract

bgr = cv2.imread("BdgJJ.png")
gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(gray, (0, 0), fx=.15, fy=.15)
text = pytesseract.image_to_string(resized)
print(text)

2nd way is using adaptive threshold

If you apply adaptive threshold:

With psm mode 6, result will be:

S O M E T E X T

Code:

import cv2
import pytesseract

bgr = cv2.imread("BdgJJ.png")
gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 21, 2)
text = pytesseract.image_to_string(thresh, config="--psm 6")
print(' '.join(ch for ch in text if ch.isalnum()).upper()[:-1])

Upvotes: 1

Tesseract OCR is inaccurate for images with letter spacing

Answers (1)

Related Questions