sebbz
sebbz

Reputation: 554

Removing internal borders

I have a lot of cropped images from table image. OCR has some problems with text detecting because of "leftovers" of table borders. Actually i'm looking for way to remove them (I want to pick-up only text). Here are some examples of them:

first image example

second image example

Thanks!

Upvotes: 0

Views: 864

Answers (1)

Oliver Wilken
Oliver Wilken

Reputation: 2714

This Code (based on opencv) solves the problem for the two examples. The procedure is the following:

  • threshold image
  • remove lines from binary objects
    • compute ratio = (area of object)/(area of bounding box)
      • if the ratio is too small we consider the object to be a combination of lines
      • if the ratio is to big we consider the object to be a single line

here the python code:

import cv2
import matplotlib.pylab as plt
import numpy as np

# load image
img = cv2.imread('om9gN.jpg',0)

# blur and apply otsu threshold
img = cv2.blur(img, (3,3))
_, img = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)

# invert image
img = (img == 0).astype(np.uint8)


img_new = np.zeros_like(img)

# find contours
_,contours,_ = cv2.findContours(img, 1, 2)

for idx, cnt in enumerate(contours):

    # get area of contour
    temp = np.zeros_like(img)
    cv2.drawContours(temp, contours , idx, 1, -1)
    area_cnt = np.sum(temp)

    # get number of pixels of bounding box of contour
    x,y,w,h = cv2.boundingRect(cnt)
    area_box = w * h

    # get ratio of cnt-area and box-area
    ratio = float(area_cnt) / area_box

    # only draw contour if:
    #    - 1.) ratio is not too big (line fills whole bounding box)
    #    - 2.) ratio is not too small (combination of lines fill very 
    #                                  small ratio of bounding box)
    if 0.9 > ratio > 0.2:
        cv2.drawContours(img_new, contours , idx, 1, -1)

plt.figure()
plt.subplot(1,2,1)
plt.imshow(img_new)
plt.axis("off")
plt.show()

Isolated letters

Upvotes: 1

Related Questions