Reputation: 169
I am using tesseract OCR to extract text from image file .
Below is the sample text I got from my Image:
Certificate No. Certificate Issued Date Acoount Reference Unique Doc. Reference IN-KA047969602415880 18-Feb-2016 01:39 PM NONACC(FI)/kakfscI08/BTM LAYOUT/KA-BA SUBIN-KAKAKSFCL0858710154264833O
How can I extract Certificate No. from this? Any hint or solution will help me here.
Upvotes: 1
Views: 5439
Reputation: 46670
Before throwing the image into Tesseract OCR, it's important to preprocess the image to remove noise and smooth the text. Here's a simple approach using OpenCV
After converting to grayscale, we Otsu's threshold to get a binary image
From here we give it a slight blur and invert the image to get our result
Results from Pytesseract
Certificate No. : IN-KA047969602415880
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png',0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]
blur = cv2.GaussianBlur(thresh, (3,3), 0)
result = 255 - blur
data = pytesseract.image_to_string(result, lang='eng', config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()
Upvotes: 1
Reputation: 1528
If the certificate number is always in the structure it is given here (2 letters, hyphen, 17 digits) you can use regex
:
import regex as re
# i took the entire sequence originally but this is just an example
sequence = 'Reference IN-KA047969602415880 18-Feb-2016 01:39'
re.search('[A-Z]{2}-.{17}', seq).group()
#'IN-KA047969602415880'
.search
searches for a specific pattern you dictate, and .group()
return the first result (in this case there would be only one). You can search for anything like this in a given string, I suggest a review of regex
here.
Upvotes: 1