Pytesseract not working sometimes on perfectly clear Images

Question

I have very high resolution Engineering drawings/Circuit diagrams which contain text in a number of different regions. The aim is to extract text from such Images.

I am using pytesseract for this task. Applying pytesseract directly is not possible as in that case text from different regions gets jumbled up in the output. So I am identifying different bounding boxes containing the text and then iteratively passing these regions to pytesseract. The bounding box logic is working fine however sometimes I get no text from the cropped Image or partial text only. I would understand if the cropped images were low res or blurred but this is not the case. Please have a look at the attached couple of examples.

Image 1

Image 2

Here is my code to get the text:

source_img_simple = cv2.imread('image_name.tif')
source_img_simple_gray = cv2.cvtColor(source_img_simple, cv2.COLOR_BGR2GRAY)
img_text = pytesseract.image_to_string(source_img_simple_gray)

# Export the text file
with open('Output_OCR.txt', 'w') as text:
    text.write(img_text)

Actual result for first image- No output (Blank text file) For 2nd Image- Partial text (ALL MISCELLANEOUS PIPING AND CONNECTION SIZES) I'm trying to know how to improve the quality of the OCR. I'm open to using any other tools as well (apart from pytesseract) if required. But can't use API's (Google, AWS etc.) as that is a restriction. Note: I have gone through the below post and it is not duplicate of my case as I have black text on white background:

Pytesseract dont reconize a very clear image

nathancy · Accepted Answer

Since your images already look clean, no preprocessing is needed. A simple approach is to threshold and Gaussian blur to smooth the image before throwing it into Pytesseract. Here's the results after simple processing and the output from Pytesseract

SYSTEM CODE IS 3CAB, EXCEPT AS INDICATED.

For the 2nd image

ALL MISCELLANEOUS PIPING AND CONNECTION SIZES SHALL BE 1 INCH. EXCEPT AS INDICATED.

We use the --psm 6 config flag since we want to treat the image as a single uniform block of text. Here's some additional configuration flags that could be useful

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR	esseract.exe"

image = cv2.imread('2.jpg',0)
thresh = cv2.threshold(image, 150, 255, cv2.THRESH_BINARY_INV)[1]

result = cv2.GaussianBlur(thresh, (5,5), 0)
result = 255 - result

data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()

Pytesseract not working sometimes on perfectly clear Images

Answers (1)

Related Questions