Reputation: 1
I want to extract the texts only and remove all other things from the following image:
Now, I want to remove all other things except the texts in the rectangle shapes. That's my code:
import cv2
import pytesseract
import numpy as np
from imutils.perspective import four_point_transform
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, convert to HSV, color threshold to get mask
image = cv2.imread('1.png')
hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
lower = np.array([0, 0, 0])
upper = np.array([100, 175, 110])
mask = cv2.inRange(hsv, lower, upper)
# Morph close to connect individual text into a single contour
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
close = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel, iterations=3)
# Find rotated bounding box then perspective transform
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
rect = cv2.minAreaRect(cnts[0])
box = cv2.boxPoints(rect)
box = np.int0(box)
cv2.drawContours(image,[box],0,(36,255,12),2)
warped = four_point_transform(255 - mask, box.reshape(4, 2))
# OCR
data = pytesseract.image_to_string(warped, lang='eng', config='--psm 6')
print(data)
cv2.imshow('mask', mask)
cv2.imshow('close', close)
cv2.imshow('warped', warped)
cv2.imshow('image', image)
cv2.waitKey()
Here's the output of the code:
The error with my code is that it shades all the things in the image, and I just want to extract only texts, not other things:
Upvotes: 0
Views: 451
Reputation: 18895
Since you have "perfect" rectangles in your image, I came up with the following approach:
Grayscale and inverse binarize the input image to get rid of possible artifacts, and to have white boxes and text on black background.
In the following, template matching will be used to find the upper left corners of the boxes of interest. So, set up a template and mask mimicking those upper left corners.
The template itself resembles a "corner" of 50 pixels length and 20 pixels height, since all boxes of interest at least have these dimensions:
The corresponding mask limits the template to a 5 pixels wide "stripe" along the "corner":
Since all texts have a margin of at least 5 pixels to the boxes' borders, there'll be "perfect" matching results, since no texts interfere with the matching.
From the "perfect" matching results, the (x, y)
coordinates of each box of interest are derived, and iterated.
The box is floodfilled with some gray color (there's only black and white in the image, due to the binarization in the beginning)
and then masked using that gray color:
From that, the bounding rectangle is determined, and that portion is copy-pasted from the original to some clean image. Also, pytesseract
is executed on the content.
Here's the full code:
import cv2
import numpy as np
import pytesseract
# Read image as grayscale
img = cv2.imread('M7X8C.png', cv2.IMREAD_GRAYSCALE)
# Inverse binarize image to get rid of possible artifacts, and to have
# white boxes and text on black background
thr = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY_INV)[1]
# Set up a template and mask mimicking the upper left corner of the
# boxes of interest
templ = np.full((20, 50), 255, dtype=np.uint8)
templ[1:, 1:] = 0
mask = np.full_like(templ, 255)
mask[5:, 5:] = 0
# Template matching
res = cv2.matchTemplate(thr, templ, cv2.TM_CCORR_NORMED, mask=mask)
# Extract upper left corners of the boxes of interest
boxes_tl = np.argwhere(res == 1)
# Initialize new clean image
clean = np.full_like(img, 255)
# For each upper left corner...
for i in np.arange(boxes_tl.shape[0]):
# Get coordinates of upper left corner
y, x = boxes_tl[i, :]
print('x: {}, y: {}'.format(x, y))
# Flood fill inner part of box, and mask that area
box_mask = cv2.floodFill(thr.copy(), None, (x + 1, y + 1), 128)[1] == 128
# Extract the bounding rectangle of that area
x, y, w, h = cv2.boundingRect(box_mask.astype(np.uint8))
# Copy box content to clean image
clean[y:y+h, x:x+w] = img[y:y+h, x:x+w]
# Run pytesseract on box content
text = pytesseract.image_to_string(thr[y:y+h, x:x+w], config='--psm 6')
print(text.replace('\f', ''))
# Output
cv2.imshow('clean', clean)
cv2.waitKey(0)
That's the clean image:
And, that's the first two pytessract
results:
x: 1, y: 0
PGGEOS KKCI 100600
x: 199, y: 39
ISOL
EMBD
CB
400
XXX
As you can see, the results are not perfect (S
instead of 5
), most likely due to the monospace font. Getting (or generating) some Tesseract traineddata
for that kind of font will surely help to overcome that issue.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.19041-SP0
Python: 3.9.1
PyCharm: 2021.1.1
NumPy: 1.19.5
OpenCV: 4.5.2
pytesseract: 5.0.0-alpha.20201127
----------------------------------------
Upvotes: 1