Reputation: 187
Pytesseract fails to recognize digits 6
and 8
. It recognizes
6
as 5
and5
as 5
,3
as 8
and8
as 8
,Oct
as 0c:
or 0::
andWed
as Men
.The script used:
config= "-c tessedit_char_whitelist=01234567890.:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz -psm 3 -oem 0"
text = pytesseract.image_to_string(image, config=config)
Tried also using the different psm number from 1-12 but no luck. Increasing contrast results in more numbers not recognized:
kernel = np.ones((2,2),np.uint8)
dilation = cv2.dilate(im, kernel)#,iterations = 1)
text = pytesseract.image_to_string(dilation, config=config)
Raw data:
After running the script:
After running new script:
Upvotes: 2
Views: 1858
Reputation: 46660
Some preprocessing to clean/smooth the image before throwing it into Pytesseract can help. Specifically, morphological operations to close small holes and remove noise can enhance the image. Also applying sharpening filters may help as well. Also adjusting the kernel size or type may help. I believe --psm 6
is the best here since the image is a single uniform block of text. Here's what I get after a simple morph close
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png',0)
thresh = cv2.threshold(image, 150, 255, cv2.THRESH_BINARY_INV)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2,2))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
result = 255 - close
data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.imshow('close', close)
cv2.waitKey()
Upvotes: 2