Zian Choy
Zian Choy

Reputation: 2894

OCR and Distinguishing Between 2 or 3 Fonts

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:

Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.

Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.

I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."

I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/[email protected]/msg02157.html), if that's relevant.

Is there anything out there that I can use to do this simple formatting classification?

Edit:

Is there anything out there that will do this without costing me an arm and a leg?

Upvotes: 3

Views: 722

Answers (1)

Nikolay
Nikolay

Reputation: 2214

I’m not sure whether tesseract can solve the task you describe, but I believe good ocr engine should detect font styles. For example, ABBYY OCR SDK can not only identify bold/italic font style, but it can also define proper font face to use in the output.

Based on what you describe I guess you are trying to determine document style hierarchy like header levels etc. ABBYY FineReader Engine provides this functionality and you don’t have engage into the font size&style based text purpose routine. Besides, it provides the best ocr quality and it’s free to try. Consider trying it out if you plan commercial software. I work @ ABBYY and can provide you more info our OCR SDK if necessary.

Best regards.

Upvotes: 1

Related Questions