Reputation: 5177
I am trying to write code in Python for the manual Image preprocessing and recognition using Tesseract-OCR.
Manual process:
For manually recognizing text for a single Image, I preprocess the Image using Gimp and create a TIF image. Then I feed it to Tesseract-OCR which recognizes it correctly.
To preprocess the image using Gimp I do -
Then I feed it tesseract -
$ tesseract captcha.tif output -psm 6
And I get an accurate result all the time.
Python Code:
I have tried to replicate above procedure using OpenCV and Tesseract -
def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'):
im_gray = cv2.imread(captcha_path, cv2.CV_LOAD_IMAGE_GRAYSCALE)
(thresh, im_bw) = cv2.threshold(im_gray, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
# although thresh is used below, gonna pick something suitable
im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1]
cv2.imwrite(binary_image_path, im_bw)
return binary_image_path
def preprocess_image_using_opencv(captcha_path):
bin_image_path = binarize_image_using_opencv(captcha_path)
im_bin = Image.open(bin_image_path)
basewidth = 300 # in pixels
wpercent = (basewidth/float(im_bin.size[0]))
hsize = int((float(im_bin.size[1])*float(wpercent)))
big = im_bin.resize((basewidth, hsize), Image.NEAREST)
# tesseract-ocr only works with TIF so save the bigger image in that format
tif_file = "input-NEAREST.tif"
big.save(tif_file)
return tif_file
def get_captcha_text_from_captcha_image(captcha_path):
# Preprocess the image befor OCR
tif_file = preprocess_image_using_opencv(captcha_path)
# Perform OCR using tesseract-ocr library
# OCR : Optical Character Recognition
image = Image.open(tif_file)
ocr_text = image_to_string(image, config="-psm 6")
alphanumeric_text = ''.join(e for e in ocr_text)
return alphanumeric_text
But I am not getting the same accuracy. What did I miss?
This code is available at https://github.com/hussaintamboli/python-image-to-text
Upvotes: 7
Views: 10987
Reputation: 7985
You have already applied the simple thresholding. The missing part is you need to read the images one-by-one
For each single-digit
Upsampling is required for accurate recognition. Adding border to the image will center the digit.
Code:
import cv2
import pytesseract
img = cv2.imread('Iv5BS.jpg')
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.threshold(gry, 128, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
(h_thr, w_thr) = thr.shape[:2]
s_idx = 2
e_idx = int(w_thr/6) - 20
result = ""
for _ in range(0, 6):
crp = thr[5:int((6*h_thr)/7), s_idx:e_idx]
(h_crp, w_crp) = crp.shape[:2]
crp = cv2.resize(crp, (w_crp*2, h_crp*2))
crp = cv2.copyMakeBorder(crp, 10, 10, 10, 10, cv2.BORDER_CONSTANT, value=255)
s_idx = e_idx
e_idx = s_idx + int(w_thr/6) - 7
txt = pytesseract.image_to_string(crp, config="--psm 6")
result += txt[0]
cv2.imshow("crp", crp)
cv2.waitKey(0)
print(result)
88BC7F
Upvotes: 2
Reputation: 11
If the output is only minimally deviating from your expected output (i.e. extra '," etc. as suggested in your comments) try limiting character recognition to the character set you expect (e.g. alphanumeric).
Upvotes: 1