amor.fati95
amor.fati95

Reputation: 119

Segmentation of lines, words and characters from a document's image

I am working on a project where I have to read the document from an image. In initial stage I will read the machine printed documents and then eventually move to handwritten document's image. However I am doing this for learning purpose, so I don't intend to use apis like Tesseract etc. I intend to do in steps:

  1. Preprocessing(Blurring, Thresholding, Erosion&Dilation)

  2. Character Segmentation

  3. OCR (or ICR in later stages)

So I am doing the character segmentation right now, I recently did it through the Horizontal and Vertical Histogram. I was not able to get very good results for some of the fonts, like the image as shown I was not able to get good results.

Document's Image

Is there any other method or algorithm to do the same? Any help will be appreciated!

Edit 1:

The result I got after detecting blobs using cv2.SimpleBlobDetector. Results

The result I got after using cv2.findContours. enter image description here

Upvotes: 4

Views: 7041

Answers (1)

user1196549
user1196549

Reputation:

A first option is by deskewing, i.e. measuring the skew angle. You can achieve this for instance by Gaussian filtering or erosion in the horizontal direction, so that the characters widen and come into contact. Then binarize and thin or find the lower edges of the blobs (or directly the directions of the blobs). You will get slightly oblique line segments which give you the skew direction.

enter image description here

When you know the skew direction, you can counter-rotate to perform de-sekwing. The vertical histogram will then reliably separate the lines, and you can use an horizontal histogram in each of them.

A second option, IMO much better, is to binarize the characters and perform blob detection. Then proximity analysis of the bounding boxes will allow you to determine chains of characters. They will tell you the lines, and where spacing is larger, delimit the words.

enter image description here

Upvotes: 7

Related Questions