how to cut image horizontally based on white regions

Question

I am preparing an image for tesseract to ocr. What I have done so far is converting my image to the following:

What I basically want is to cut the image to horizontal portions based on the white regions. So like so:

What I most care about are the text area in the left side and the middle one.

The problem if I only pick left region is that I can't find a way to pick also the ones in the middle without deleting some parts.

The other problem that I faced is if I give tesseract all the regions (I have successfully already extracted every region that contains text) is that it gave me rubbish since the pic has both Latin and none latin language.

Another important thing is that there is no predefined size, so assuming the size in this picture is standard is wrong.

To recapitulate: how to cut image horizontally based on white regions

Jeru Luke · Accepted Answer

I looked up the documentation to see if I could use anything. And YES I came across an interesting property called the extent of a contour from THIS PAGE.

The extent of a contour is defined as the ration of area of the contour by area of the bounding rectangle of that contour. So the more closer this value is to 1 the more the contour resembles a rectangle.

For the image given by you it does not detect the words that look like Arabic. But it would work if some morphological operation is done prior to this.

Code:

path = 'C:/Users/Desktop/Stack/contour/'
im = cv2.imread(path + 'lic.png')

#--- resized because the image was to big ---
im = cv2.resize(im, (0, 0), fx = 0.5, fy = 0.5)
imgray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)

ret2, th2 = cv2.threshold(imgray, 0, 255,cv2.THRESH_BINARY + cv2.THRESH_OTSU)

im2 = im.copy()
_, contours, hierarchy = cv2.findContours(th2, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
count = 0

#--- It all begins here ---
for cnt in contours:
        area = cv2.contourArea(cnt)
        x, y, w, h = cv2.boundingRect(cnt)
        rect_area = w * h
        extent = float(area) / rect_area
        if (extent > 0.5) and (area > 100):      #--- there were some very small rectangular regions hence I used the area criterion ---
            count+=1
            cv2.drawContours(im2, [cnt], 0, (0, 255, 0), 2)

cv2.imshow(path + 'contoursdate.jpg', im2)

print('Number of possible words : {}'.format(count))

Result:

In this case I have just drawn the contours. You on the other hand, can crop these regions by fitting a bounding rectangle and feed them individually to an OCR engine.

how to cut image horizontally based on white regions

Answers (2)

Related Questions