Reputation: 21

How can I add a new font to Tesseract 4.0?

I'm making a text identification program and I want to train my Tesseract 4.0 to identify a specific font (in Hebrew). How can I do it?

I tried "trainyourtesseract.com" (that did'nt work at all) and "jTessBoxEditor" (that I didn't understand how to make it work properly).

I would love to get some help with that issue. Thanks.

Upvotes: 2

Answers (2)

Thusitha Deepal

Reputation: 1546

Detail Video watch this : https://www.youtube.com/watch?v=N5Y6gZgvryQ

Here is the shell script for the tesseract custom training

N=3 # number of images

#image name => languagename.fontname.expN.filetype

make box file

for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i batch.nochop makebox
done

after manually edit box file following steps need to be done

#Step 02: Create .tr file (Compounding image file and box file)

step 3: Extract the charset from the box files (Output for this command is unicharset file)

for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i box.train
unicharset_extractor  testlan.arial.exp$i.box
done

step 4: Create a font_properties file based on our needs.

echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" > font_properties

echo "arial 0 0 1 0 0" > font_properties

Step 5: Training the data.

#Step 6

for i in `seq 1 $N`
do
mftraining -F font_properties -U unicharset -O testlan.unicharset testlan.arial.exp$i.tr
cntraining testlan.arial.exp$i.tr
done

#after step 5 and step 6 shapetable,inttemp,pffmtable,normproto files created

Step 7: Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)

 mv inttemp testlan.inttemp
 mv normproto testlan.normproto
 mv pffmtable testlan.pffmtable
 mv shapetable testlan.shapetable

combine_tessdata testlan.

#move testlan.traineddata to C:\Program Files\Tesseract-OCR\tessdata

Upvotes: 0

tehtea

Reputation: 326

did you try reading this link? https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining The rough approach is that you have to prepare your own language files (and most importantly your own .trainingtext file), then run tesstrain.sh to generate the dataset. After that, you can run combine_tessdata to extract the .lstm file from the original Hebrew model and use it as a parameter in the lstmtraining tool to finetune the original model with your new font.

UPDATE: the documentation link has changed: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00