How should I remove noise from this thresholded image in OpenCV?

Question

I would like to remove anything that is not part of the letters and numbers in the image. The input image is as such:

I have tried to apply canny edge detection, but it is susceptible to noise, and the noise contours are quite big. Due to this reason, morphological operations have also been unsuccessful. I tried cv2.MORPH_CLOSE but the noise areas got bigger.

My code is here, but it's completely useless as of now in removing noise:

import imutils

input=cv2.imread("n4.jpg")
resized = imutils.resize(input, width=700)
cv2.imshow("resized",resized)

blur = cv2.GaussianBlur(resized,(7,7),0)
cv2.imshow("blur",blur)

gray = cv2.cvtColor(resized, cv2.COLOR_BGR2GRAY)
threshINV  = cv2.threshold(gray, 140, 255, cv2.THRESH_BINARY_INV)[1]
cv2.imshow("thresh",threshINV)

e = cv2.Canny(threshINV,20,50)
cv2.imshow("e",e)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (4,4))
close = cv2.morphologyEx(threshINV, cv2.MORPH_CLOSE, kernel)
cv2.imshow("close",close)


edged = cv2.Canny(gray, 20, 50)
dilat = cv2.dilate(edged, None, iterations=1)
cv2.imshow("test",dilat)
cv2.waitKey(0)
cv2.destroyAllWindows()

I've looked at this example and this other example, however they would not work because of the size of the noise and the fact that the contours I would like to keep do not have a definable shape.

I've also looked at this method, but again I don't think it will work since there is no overall contour to smooth out.

Rotem · Accepted Answer

The image you have posted is very challenging.
The solution I am posting is too specific for the image you have posted.
I tried to keep it as general as I could, but I don't expect it to work very well on other images.
You may use it for getting ideas for more options for removing noise.

The solution is mainly based on finding connected components and removing the smaller components - considered to be noise.

I used pytesseract OCR for checking if the result is clean enough for OCR.

Here is the code (please read the comments):

import numpy as np
import scipy.signal
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR	esseract.exe"  # For Windows OS

# Read input image
input = cv2.imread("n4.jpg")

# Convert to Grayscale.
gray = cv2.cvtColor(input, cv2.COLOR_BGR2GRAY)

# Convert to binary and invert polarity
ret, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Find connected components (clusters)
nlabel, labels, stats, centroids = cv2.connectedComponentsWithStats(thresh, connectivity=8)


# Remove small clusters: With both width<=10 and height<=10 (clean small size noise).
for i in range(nlabel):
    if (stats[i, cv2.CC_STAT_WIDTH] <= 10) and (stats[i, cv2.CC_STAT_HEIGHT] <= 10):
        thresh[labels == i] = 0

#Use closing with very large horizontal kernel
mask = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, np.ones((1, 150)))

# Find connected components (clusters) on mask
nlabel, labels, stats, centroids = cv2.connectedComponentsWithStats(mask, connectivity=8)

# Find label with maximum area
# https://stackoverflow.com/questions/47520487/how-to-use-python-opencv-to-find-largest-connected-component-in-a-single-channel
largest_label = 1 + np.argmax(stats[1:, cv2.CC_STAT_AREA])

# Set to zero all clusters that are not the largest cluster.
thresh[labels != largest_label] = 0

# Use closing with horizontal kernel of 15 (connecting components of digits)
mask = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, np.ones((1, 15)))

# Find connected components (clusters) on mask again
nlabel, labels, stats, centroids = cv2.connectedComponentsWithStats(mask, connectivity=8)

# Remove small clusters: With both width<=30 and height<=30
for i in range(nlabel):
    if (stats[i, cv2.CC_STAT_WIDTH] <= 30) and (stats[i, cv2.CC_STAT_HEIGHT] <= 30):
        thresh[labels == i] = 0

# Use closing with horizontal kernel of 15, this time on thresh
thresh = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, np.ones((1, 15)))

# Use median filter with 3x5 mask (using OpenCV medianBlur with k=5 is removes important details).
thresh = scipy.signal.medfilt(thresh, (3,5))

# Inverse polarity
thresh = 255 - thresh

# Apply OCR
data = pytesseract.image_to_string(thresh, config="-c tessedit"
                                                  "_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-/"
                                                  " --psm 6"
                                                  " ")

print(data)

# Show image for testing
cv2.imshow('thresh', thresh)
cv2.waitKey(0)
cv2.destroyAllWindows()

thresh (clean image):

OCR result: EXPO22016/01-2019

How should I remove noise from this thresholded image in OpenCV?

Answers (1)

Related Questions