iRadium
iRadium

Reputation: 255

OCR word separation

I'm developing an OCR system, and need some help in word segmentation.

Currently the OCR system detects blobs in a line (using connected components labeling algorithm). Each blob represents a separate letter, and has a bounding box around it. Some characters may overlap in their bounding boxes.

How can I join those letters into words? How to decide on the best distance that separates words from each other, so that: 1. words will not be cut 2. words will not be joined to other words From what I've seen - the distance between letters and words may vary a lot.

This part is done before the letter classification, so separating by the actual word meaning is not possible.

Thank you!

Upvotes: 1

Views: 2408

Answers (2)

Leafdoc
Leafdoc

Reputation: 46

I'd be inclined to try to read the characters first. This will allow you to use a (language dependant) tool that examines word endings to help confirm you've reached the end. This information will allow you to bias your 'white space' detection and so improve the quality of your word endings. It has the additional use of strengthening your accuracy - actually, it helps you know with more confidence when you are wrong ;)

White space is hard to deal with and the majority of API's that I know of (including our own) returns a single character of white space regardless of how much space there is. If you're trying to process information that is laid out in table form (a letter with an address block top left and top right for example), you usually get a single space between the two sets of data. Storing the position of each character will help with the post-processing, of course.

Good luck!

Upvotes: 0

argentage
argentage

Reputation: 2778

If you take a histogram of each vertical column of pixels, you will probably find that the separation between words tends to be among the lowest. If you insist on processing the word separators before the letters themselves, techniques like this that are combined together with some sort of binary classifier are probably a good starting point. (For example, you could weigh together the average lengths of words in your corpus with this histogram.)

See: http://www.ijcaonline.org/rtippr/number1/SPE96T.pdf

Upvotes: 1

Related Questions