Slicing of a scanned image based on large white spaces

Question

I am planning to split the questions from this PDF document. The challenge is that the questions are not orderly spaced. For example the first question occupies an entire page, second also the same while the third and fourth together make up one page. If I have to manually slice it, it will be ages. So, I thought to split it up into images and work on them. Is there a possibility to take image as this

and split into individual components like this?

Rotem · Accepted Answer

We may solve it using (mostly) morphological operations:

Read the input image as grayscale.
Apply thresholding with inversion.
Automatic thresholding using cv2.THRESH_OTSU is working well.
Apply opening morphological operation for removing small artifacts (using the kernel np.ones(1, 3))
Dilate horizontally with very long horizontal kernel - make horizontal lines out of the text lines.
Apply closing vertically - create two large clusters.
The size of the vertical kernel should be tuned according to the typical gap.
Finding connected components with statistics.
Iterate the connected components and crop the relevant area in the vertical direction.

Complete code sample:

import cv2
import numpy as np

img = cv2.imread('scanned_image.png', cv2.IMREAD_GRAYSCALE)  # Read image as grayscale

thesh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]  # Apply automatic thresholding with inversion.

thesh = cv2.morphologyEx(thesh, cv2.MORPH_OPEN, np.ones((1, 3), np.uint8))  # Apply opening morphological operation for removing small artifacts.

thesh = cv2.dilate(thesh, np.ones((1, img.shape[1]), np.uint8))  # Dilate horizontally - make horizontally  lines out of the text.

thesh = cv2.morphologyEx(thesh, cv2.MORPH_CLOSE, np.ones((50, 1), np.uint8))  # Apply closing vertically - create two large clusters

nlabel, labels, stats, centroids = cv2.connectedComponentsWithStats(thesh, 4)  # Finding connected components with statistics

parts_list = []

# Iterate connected components:
for i in range(1, nlabel):
    top = int(stats[i, cv2.CC_STAT_TOP])  # Get most top y coordinate of the connected component
    height = int(stats[i, cv2.CC_STAT_HEIGHT])  # Get the height of the connected component

    roi = img[top-5:top+height+5, :]  # Crop the relevant part of the image (add 5 extra rows from top and bottom).
    parts_list.append(roi.copy()) # Add the cropped area to a list

    cv2.imwrite(f'part{i}.png', roi)  # Save the image part for testing
    cv2.imshow(f'part{i}', roi)  # Show part for testing

# Show image and thesh testing
cv2.imshow('img', img)
cv2.imshow('thesh', thesh)

cv2.waitKey()
cv2.destroyAllWindows()

Results:

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Top area:

Bottom area:

Slicing of a scanned image based on large white spaces

Answers (2)

Related Questions