Reputation: 21
I'm making a text identification program and I want to train my Tesseract 4.0 to identify a specific font (in Hebrew). How can I do it?
I tried "trainyourtesseract.com" (that did'nt work at all) and "jTessBoxEditor" (that I didn't understand how to make it work properly).
I would love to get some help with that issue. Thanks.
Upvotes: 2
Views: 6510
Reputation: 1546
Detail Video watch this : https://www.youtube.com/watch?v=N5Y6gZgvryQ
Here is the shell script for the tesseract custom training
N=3 # number of images
#image name => languagename.fontname.expN.filetype
for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i batch.nochop makebox
done
#Step 02: Create .tr file (Compounding image file and box file)
for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i box.train
unicharset_extractor testlan.arial.exp$i.box
done
echo "arial 0 0 1 0 0" > font_properties
#Step 6
for i in `seq 1 $N`
do
mftraining -F font_properties -U unicharset -O testlan.unicharset testlan.arial.exp$i.tr
cntraining testlan.arial.exp$i.tr
done
#after step 5 and step 6 shapetable,inttemp,pffmtable,normproto files created
mv inttemp testlan.inttemp
mv normproto testlan.normproto
mv pffmtable testlan.pffmtable
mv shapetable testlan.shapetable
combine_tessdata testlan.
#move testlan.traineddata to C:\Program Files\Tesseract-OCR\tessdata
Upvotes: 0
Reputation: 326
did you try reading this link? https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining
The rough approach is that you have to prepare your own language files (and most importantly your own .trainingtext file), then run tesstrain.sh to generate the dataset. After that, you can run combine_tessdata to extract the .lstm file from the original Hebrew model and use it as a parameter in the lstmtraining
tool to finetune the original model with your new font.
UPDATE: the documentation link has changed: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00
Upvotes: 2