qanpi
qanpi

Reputation: 33

Tesseract OCR is inaccurate for images with letter spacing

I'm trying to use Tesseract OCR to extract a string of characters (not a valid word) from an image. The issue is that the characters in the image are spaced out, like in the picture below. enter image description here With default properties, this image is recognized as 5 O M E T E—E X fT.

I tried to tinker with the page segmentation properties, but the closest I got is "SOME TEXT. with --psm 8. I'm wondering if there is a setting that will enable Tesseract to better deal with the spacing in between the letters, or if I need to train a custom model.

Upvotes: 1

Views: 946

Answers (1)

Ahx
Ahx

Reputation: 8005

1st way is resizing the image.

If you resize the image (0.15, 0.15)

enter image description here

With default properties you will get:

S O M E T E X T

Code:

import cv2
import pytesseract

bgr = cv2.imread("BdgJJ.png")
gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(gray, (0, 0), fx=.15, fy=.15)
text = pytesseract.image_to_string(resized)
print(text)

2nd way is using adaptive threshold

If you apply adaptive threshold:

enter image description here

With psm mode 6, result will be:

S O M E T E X T 

Code:

import cv2
import pytesseract

bgr = cv2.imread("BdgJJ.png")
gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 21, 2)
text = pytesseract.image_to_string(thresh, config="--psm 6")
print(' '.join(ch for ch in text if ch.isalnum()).upper()[:-1])

Upvotes: 1

Related Questions