Jaro Kollár
Jaro Kollár

Reputation: 263

Tesseract - training

I am trying to learn something the tesseract.

I am using jTessBoxEditor and Serak.

First I create some .txt which have for example 10 000 characters and they are separated with one space. I use this as input for jTessBoxEditor in TIFF/BOX generator. This creates for me boxes and .tiff image.

Now I verify the boxes and I see that they are correct. So I use it in Serak and traing tesseract and I create some xxx.traineddata.

Now I want to verify the results. So I create small .txt for example with 100 characters separated by space, but all are very similarly (file contains something like 5 S 5 S 0 O 2 Z and so on.). Now I create .tiff with same approach like in learning, so I use jTessBoxEditor, same font and I generate new .tiff file. Than in Serak I try to OCR this new .tiff and result is that 0 is mixed with O, 5 with S and so on.

What am I doing wrong?

Upvotes: 0

Views: 730

Answers (1)

Yep
Yep

Reputation: 141

Are you sure that the new font you created made it in to the .traineddate file? You must add the font into the font-properties file, run unicharset_extractor on fonts, then mftraining and cntraining and then combine everything together to get the resulting .traineddata file. I have had a similar situation as you are having and I would guess that most likely the error is in the creation of the .traineddata file. After your new font is in, tesseract should have no problem determining what the characters are of files you just trained it with.

Upvotes: 0

Related Questions