Reputation: 323
I am trying to extract characters from a form for OCR and after experimenting with connected components, MSER and contours, found contours to be the most reliable. The problem though, is that, at times, it fails to detect shapes which are very similar to the ones it has already detected. For instance, in the attached image, "A" in row # 1, col 4 is undetected, while just 2 columns away, it is! Same thing for the "A" in row 3 (col 3 vs col 7).
here's the code i am using to get the above
im = cv2.imread('IMAGES/ACH0.png')
imgray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
imgray = cv2.GaussianBlur(imgray, (5, 5), 0)
(ret, thresh) = cv2.threshold(imgray, 127, 255, cv2.THRESH_BINARY_INV +cv2.THRESH_OTSU
im2, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_LIST ,cv2.CHAIN_APPROX_SIMPLE)
areas = [cv2.contourArea(c) for c in contours]
for ctr in range(len(areas)):
if areas[ctr] > 10000: continue
cnt=contours[ ctr ]
x,y,w,h = cv2.boundingRect(cnt)
cv2.rectangle(im,(x,y),(x+w,y+h),(0,255,0),1)
i tried reading up on the inner workings of the cv2 implementation of findContours but couldn't find any resources on it (if i could find it, i could at least debug and understand why this happens). Any pointers would be gratefully acknowledged.
Upvotes: 0
Views: 157
Reputation:
Characters that touch the grid cannot be isolated because they belong to a larger blob.
As the grid seems to be well aligned, you can try to locate the grid lines and erase them before performing OCR.
Upvotes: 2