Hasaankhattak
Hasaankhattak

Reputation: 1

How to remove background lines and shapes from an image for text extraction?

I want to extract the texts only and remove all other things from the following image:

Input image

Now, I want to remove all other things except the texts in the rectangle shapes. That's my code:

import cv2
import pytesseract
import numpy as np
from imutils.perspective import four_point_transform

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, convert to HSV, color threshold to get mask
image = cv2.imread('1.png')
hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
lower = np.array([0, 0, 0])
upper = np.array([100, 175, 110])
mask = cv2.inRange(hsv, lower, upper)

# Morph close to connect individual text into a single contour
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
close = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel, iterations=3)

# Find rotated bounding box then perspective transform
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
rect = cv2.minAreaRect(cnts[0])
box = cv2.boxPoints(rect)
box = np.int0(box)
cv2.drawContours(image,[box],0,(36,255,12),2)
warped = four_point_transform(255 - mask, box.reshape(4, 2))

# OCR
data = pytesseract.image_to_string(warped, lang='eng', config='--psm 6')
print(data)

cv2.imshow('mask', mask)
cv2.imshow('close', close)
cv2.imshow('warped', warped)
cv2.imshow('image', image)
cv2.waitKey()

Here's the output of the code:

Output

The error with my code is that it shades all the things in the image, and I just want to extract only texts, not other things:

Output

Upvotes: 0

Views: 451

Answers (1)

HansHirse
HansHirse

Reputation: 18895

Since you have "perfect" rectangles in your image, I came up with the following approach:

  1. Grayscale and inverse binarize the input image to get rid of possible artifacts, and to have white boxes and text on black background.

  2. In the following, template matching will be used to find the upper left corners of the boxes of interest. So, set up a template and mask mimicking those upper left corners.

    The template itself resembles a "corner" of 50 pixels length and 20 pixels height, since all boxes of interest at least have these dimensions:

    Template

    The corresponding mask limits the template to a 5 pixels wide "stripe" along the "corner":

    Template mask

    Since all texts have a margin of at least 5 pixels to the boxes' borders, there'll be "perfect" matching results, since no texts interfere with the matching.

  3. From the "perfect" matching results, the (x, y) coordinates of each box of interest are derived, and iterated.

    The box is floodfilled with some gray color (there's only black and white in the image, due to the binarization in the beginning)

    Floodfilled

    and then masked using that gray color:

    Masked

  4. From that, the bounding rectangle is determined, and that portion is copy-pasted from the original to some clean image. Also, pytesseract is executed on the content.

Here's the full code:

import cv2
import numpy as np
import pytesseract

# Read image as grayscale
img = cv2.imread('M7X8C.png', cv2.IMREAD_GRAYSCALE)

# Inverse binarize image to get rid of possible artifacts, and to have
# white boxes and text on black background
thr = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY_INV)[1]

# Set up a template and mask mimicking the upper left corner of the
# boxes of interest
templ = np.full((20, 50), 255, dtype=np.uint8)
templ[1:, 1:] = 0
mask = np.full_like(templ, 255)
mask[5:, 5:] = 0

# Template matching
res = cv2.matchTemplate(thr, templ, cv2.TM_CCORR_NORMED, mask=mask)

# Extract upper left corners of the boxes of interest
boxes_tl = np.argwhere(res == 1)

# Initialize new clean image
clean = np.full_like(img, 255)

# For each upper left corner...
for i in np.arange(boxes_tl.shape[0]):

    # Get coordinates of upper left corner
    y, x = boxes_tl[i, :]
    print('x: {}, y: {}'.format(x, y))

    # Flood fill inner part of box, and mask that area
    box_mask = cv2.floodFill(thr.copy(), None, (x + 1, y + 1), 128)[1] == 128

    # Extract the bounding rectangle of that area
    x, y, w, h = cv2.boundingRect(box_mask.astype(np.uint8))

    # Copy box content to clean image
    clean[y:y+h, x:x+w] = img[y:y+h, x:x+w]

    # Run pytesseract on box content
    text = pytesseract.image_to_string(thr[y:y+h, x:x+w], config='--psm 6')
    print(text.replace('\f', ''))

# Output
cv2.imshow('clean', clean)
cv2.waitKey(0)

That's the clean image:

Clean

And, that's the first two pytessract results:

x: 1, y: 0
PGGEOS KKCI 100600

x: 199, y: 39
ISOL
EMBD
CB
400
XXX

As you can see, the results are not perfect (S instead of 5), most likely due to the monospace font. Getting (or generating) some Tesseract traineddata for that kind of font will surely help to overcome that issue.

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.19041-SP0
Python:        3.9.1
PyCharm:       2021.1.1
NumPy:         1.19.5
OpenCV:        4.5.2
pytesseract:   5.0.0-alpha.20201127
----------------------------------------

Upvotes: 1

Related Questions