Reputation: 533
I am trying to extract characters from all fields in forms with boxes such as the one shown here:
Sample printed form
My current approach is as follows:
The boxes may be slightly tilted in the images. I use an alignment algorithm but it still does not always straighten the box edges. This can be seen in this image:
Aligned date crop
.
On such images, when I crop characters using straight lines (step 3 of the algorithm mentioned above), the edges of boxes are also included which confuse the character recognition module. For example, the number '3' and 'the box edge' is represented as 31 sometimes.
I want to use pre-trained models only and hence, I am looking for a better way to extract characters from the boxed fields properly.
I would highly appreciate any help provided by the SO community.
Upvotes: 1
Views: 1041
Reputation: 439
As the box edges are generally thinner (as in your case) than the text inside them, we can leverage this information. By applying a horizontal morphological closing kernel (Dilation -> Erosion), we can make thin vertical lines white, which will help OCR. Some trash may be left after processing but that wouldn't hinder OCR accuracy. The size of the kernel depends on the width of the border lines. Obviously, you can tweak it as per your case.
Here is the sample code:
import cv2
import numpy as np
im = cv2.imread('sample_image.png')
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
k1 = (4,1)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, k1)
im = cv2.morphologyEx(im, cv2.MORPH_CLOSE, kernel, iterations=1)
_,im = cv2.threshold(im, thresh=200, maxval=255, type=cv2.THRESH_BINARY)
cv2.imwrite('sample_output.png',im)
And here are the images:
Upvotes: 1