Reputation: 175
One of the problems that I am working on is to do OCR on documents. A few of the paystub document have a highlighted line with dots to differentiate important elements like Gross Pay, Net Pay, etc.
These dots give erroneous results in OCR, it considers them as ':' character and doesn't give desired results. I have tried a lot of things for image processing such as ImageMagick, etc to remove these dots. But in each case the quality of entire text data is degraded resulting in poor OCR.
ImageMagick commands that I have tried is:
convert mm150.jpg -kuwahara 3 mm2.jpg
I have also tried connected components, erosion with kernels, etc, but each method fails in some way.
I would like to know if there is some method that I should follow, or am I missing something from Image Processing capabilities.
Upvotes: 3
Views: 2157
Reputation: 175
This issue can be resolved using connectedComponentsWithStats function of opencv. I found reference for this from this question How do I remove the dots / noise without damaging the text?
I changed it a bit to fit as per my needs. And this is the code that helped me get desired output.
import cv2
import numpy as np
import sys
img = cv2.imread(sys.argv[1], 0)
_, blackAndWhite = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY_INV)
nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, 4, cv2.CV_32S)
sizes = stats[1:, -1] #get CC_STAT_AREA component
img2 = np.zeros((labels.shape), np.uint8)
for i in range(0, nlabels - 1):
if sizes[i] >= 8: #filter small dotted regions
img2[labels == i + 1] = 255
res = cv2.bitwise_not(img2)
cv2.imwrite('res.jpg', res)
The output file that I got is pretty clear with dotted band removed such as it gives perfect OCR results.
Upvotes: 6