Reputation: 11907
I have a binary image like this,
I want to extract the numbers in the image using tesseract ocr in Python. I used pytesseract
like this on the image,
txt = pytesseract.image_to_string(img)
But I am not getting any good results.
What can I do in pre-processing or augmentation that can help tesseract do better.?
I tried to localize the text from the image using East Text Detector
but it was not able to recognize the text.
How to proceed with this in python.?
Upvotes: 2
Views: 2012
Reputation: 7985
I think the page-segmentation-mode is an important factor here.
Since we are trying to read column values, we could use --psm 4
(source)
import cv2
import pytesseract
img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
txt = pytesseract.image_to_string(gry, config="--psm 4")
We want to get the text starts with #
txt = sorted([t[:2] for t in txt if "#" in t])
Result:
['#3', '#7', '#9', '#€']
But we miss 4, 5, we could apply adaptive-thresholding
:
Result:
['#3', '#4', '#5', '#7', '#9', '#€']
Unfortunately, #2
and #6
are not recognized.
Code:
import cv2
import pytesseract
img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 252, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, blockSize=131, C=100)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
txt = sorted([t[:2] for t in txt if "#" in t])
print(txt)
Upvotes: 2