How to remove dotted band from text image?

Question

One of the problems that I am working on is to do OCR on documents. A few of the paystub document have a highlighted line with dots to differentiate important elements like Gross Pay, Net Pay, etc.

These dots give erroneous results in OCR, it considers them as ':' character and doesn't give desired results. I have tried a lot of things for image processing such as ImageMagick, etc to remove these dots. But in each case the quality of entire text data is degraded resulting in poor OCR.

ImageMagick commands that I have tried is:

convert mm150.jpg -kuwahara 3 mm2.jpg

I have also tried connected components, erosion with kernels, etc, but each method fails in some way.

I would like to know if there is some method that I should follow, or am I missing something from Image Processing capabilities.

Mohammed Jamali · Accepted Answer

This issue can be resolved using connectedComponentsWithStats function of opencv. I found reference for this from this question How do I remove the dots / noise without damaging the text?

I changed it a bit to fit as per my needs. And this is the code that helped me get desired output.

    import cv2
    import numpy as np
    import sys

    img = cv2.imread(sys.argv[1], 0)
    _, blackAndWhite = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY_INV)


    nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, 4, cv2.CV_32S)
    sizes = stats[1:, -1] #get CC_STAT_AREA component
    img2 = np.zeros((labels.shape), np.uint8)

    for i in range(0, nlabels - 1):
        if sizes[i] >= 8:   #filter small dotted regions
            img2[labels == i + 1] = 255

    res = cv2.bitwise_not(img2)

    cv2.imwrite('res.jpg', res)

The output file that I got is pretty clear with dotted band removed such as it gives perfect OCR results.

How to remove dotted band from text image?

Answers (1)

Related Questions