OCR and Distinguishing Between 2 or 3 Fonts

Question

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:

Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.

Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.

I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."

I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg02157.html), if that's relevant.

Is there anything out there that I can use to do this simple formatting classification?

Edit:

Is there anything out there that will do this without costing me an arm and a leg?

OCR and Distinguishing Between 2 or 3 Fonts

Answers (1)

Related Questions