Reputation: 169

Extract specific contents from text using python and Tesseract OCR

I am using tesseract OCR to extract text from image file .

Below is the sample text I got from my Image:

Certificate No. Certificate Issued Date Acoount Reference Unique Doc. Reference IN-KA047969602415880 18-Feb-2016 01:39 PM NONACC(FI)/kakfscI08/BTM LAYOUT/KA-BA SUBIN-KAKAKSFCL0858710154264833O

How can I extract Certificate No. from this? Any hint or solution will help me here.

Upvotes: 1

Answers (2)

nathancy

Reputation: 46670

Before throwing the image into Tesseract OCR, it's important to preprocess the image to remove noise and smooth the text. Here's a simple approach using OpenCV

Convert image to grayscale
Otsu's threshold to obtain binary image
Gaussian blur and invert image

After converting to grayscale, we Otsu's threshold to get a binary image

From here we give it a slight blur and invert the image to get our result

Results from Pytesseract

Certificate No. : IN-KA047969602415880

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = cv2.imread('1.png',0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]

blur = cv2.GaussianBlur(thresh, (3,3), 0)
result = 255 - blur 

data = pytesseract.image_to_string(result, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()

Upvotes: 1

Ronny Efronny

Reputation: 1528

If the certificate number is always in the structure it is given here (2 letters, hyphen, 17 digits) you can use regex:

import regex as re

# i took the entire sequence originally but this is just an example
sequence = 'Reference IN-KA047969602415880 18-Feb-2016 01:39'
re.search('[A-Z]{2}-.{17}', seq).group()
#'IN-KA047969602415880'

.search searches for a specific pattern you dictate, and .group() return the first result (in this case there would be only one). You can search for anything like this in a given string, I suggest a review of regex here.

Upvotes: 1

Extract specific contents from text using python and Tesseract OCR

Answers (2)

Related Questions