Christoph Bätz
Christoph Bätz

Reputation: 71

Tesseract OCR - recognize checkboxes as word

for a customer I want to teach Tesseract to recognize checkboxes as a word. It worked fine when Tesseract should recognize a empty checkbox.

This command in combination with this tutorial worked like a charm and Tesseract was able to find empty checkboxes and interpret them to "[_]":

tesseract -psm 10 deu2.unchecked1.exp0.JPG deu2.unchecked1.exp0.box nobatch box.train

Here is my command to successful analyze a document:

tesseract test.png test -l deu1+deu2

Then I tried to train a checked checkbox, but got this error:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 1/[X] ((60,30),(314,293)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:       1
   Boxes failed resegmentation:       1
   Found 0 good blobs.
Generated training data for 0 words

Does anyone have an idea how to teach Tesseract recognize checked checkboxes as well?

Thank you in advance!

Upvotes: 5

Views: 3962

Answers (1)

Christoph Bätz
Christoph Bätz

Reputation: 71

After much more tries I figured out that it is of course possible to teach Tesseract different kind of letters. But as I know today, there is no possibility to teach Tesseract a sign which is not conform to some "visual rules" of a letter. For example: A letter is always one connected line of ink, at most a combination of ink and "something outside it" (for example: i,ä,ö,ü) Problem here ist that there is nothing what is similiat to checkbox (one object in antother object) This leads for Tesseract to irritations and crashes.

Upvotes: 2

Related Questions