Juan Arevalo
Juan Arevalo

Reputation: 41

Training Tesseract for a new font

When creating the CLUSTERING data using

mftraining -F font_properties -U unicharset -O lan.unicharset *.tr

I get the following message

C:\Users\ \AppData\Local\Tesseract-OCR>mftraining -F font_properties -U unicharset -O eng1.unicharset eng.lucidaconsole.box.tr <http://eng.lucidaconsole.box.tr>

Warning: No shape table file present: shapetable
Failed to load unicharset from file unicharset
Building unicharset for training from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Reading eng.lucidaconsole.box.tr <http://eng.lucidaconsole.box.tr> ...

Flat shape table summary: Number of shapes = 0 max unichars = 0 number with multiple unichars = 0

Done!

It rebuilt the unicharset I had done already and gives me one with 1kb worth of data with only this in it

1
NULL 0 NULL 0

At this point I don't know what to do. I am a first time user to this program but to me this doesn't seem right?

Upvotes: 3

Views: 3049

Answers (2)

FlySoFast
FlySoFast

Reputation: 1932

If you're using Windows,I think this tool can help you to make the training process much MUCH easier. I've been through a lot of troubles learning how to train Tesseract before using it. Just download the latest version and read the User manual, you will be able to train you Tesseract without touching the keyboard!

Upvotes: 0

mlissner
mlissner

Reputation: 18206

It looks like you need to cluster the the character features of the training pages, as described here.

I believe the basic command for this is something like:

shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

This appears to be something that was added in version 3.02.

Upvotes: 2

Related Questions