Accounting for 'i' and 'j' dots in OCR python

Question

I am trying to create an OCR system in python - the first part involves extracting all characters from an image. This works fine and all characters are separated into their own bounding boxes.

Code attached below:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from scipy.misc import imread,imresize
from skimage.segmentation import clear_border
from skimage.morphology import label
from skimage.measure import regionprops


image = imread('./ocr/testing/adobe.png',1)


bw = image < 120


cleared = bw.copy()
clear_border(cleared)


label_image = label(cleared,neighbors=8)
borders = np.logical_xor(bw, cleared)
label_image[borders] = -1

print label_image.max()

fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(6, 6))
ax.imshow(bw, cmap='jet')



for region in regionprops(label_image):

    if region.area > 20:


        minr, minc, maxr, maxc = region.bbox

        rect = mpatches.Rectangle((minc, minr), maxc - minc, maxr - minr,
                              fill=False, edgecolor='red', linewidth=2)
        ax.add_patch(rect)

plt.show()

However, since the letters i and j have 'dots' on top of them, the code takes the dots as separate bounding boxes. I am using the regionprops library. Would it also be a good idea to resize and normalise each bounding box?

How would i modify this code to account for i and j? My understanding would be that I would need to merge the bounding boxes that are closeby? Tried with no luck... Thanks.

Martin Kr&#228;mer · Accepted Answer

Yes, you generally want to normalize the content of your bounding boxes to fit your character classifier's input dimensions (assuming you are working on character classifiers with explicit segmentation and not with sequence classifiers segmenting implicitly).

For merging vertically isolated CCs of the same letter, e.g. i and j, I'd try an anisotropic Gaussian filter (very small sigma in x-direction, larger in y-direction). The exact parameterization will depend on your input data, but it should be easy to find a suitable value experimentally such that all letters result in exactly one CC.

An alternative would be to analyze CCs which exhibit horizontal overlap with other CCs and merge those pairs where the overlap exceeds a certain relative threshold.

Illustrating on the given example:

# Anisotropic Gaussian
from scipy.ndimage.filters import gaussian_filter

filtered = gaussian_filter(f, (2,0))
plt.imshow(filtered, cmap=plt.cm.gray)

# Now threshold
bin = filtered < 1
plt.imshow(bin, cmap=plt.cm.gray)

It's easy to see that each character is now represented by exactly one CC. Now we pretty much only have to apply each mask and crop the white areas to end up with the bounding boxes for each character. After normalizing their size we can directly feed them to the classifier (consider that we lose ascender/descender line information as well as width/height ratio though and those may be useful as a feature for the classifier; so it should help feeding them explicitly to the classifier in addition to the normalized bounding box content).

Accounting for 'i' and 'j' dots in OCR python

Answers (1)

Related Questions

Accounting for &#39;i&#39; and &#39;j&#39; dots in OCR python

Answers (1)

Related Questions

Accounting for 'i' and 'j' dots in OCR python