Reputation: 1609
I am using tesseract for OCR in an Android app. I am focusing on the Chinese language but I only need to recognise a few keywords so I was thinking of creating my .traineddata files using jTessBoxEditor. I wanted to know what fonts does the Chinese Traditional TessData file use? https://github.com/tesseract-ocr/tessdata
Alternatively, is there a way that I can edit the chi_tra.traineddata file so it only recognises a few keywords? The main reason I am doing this is because the file size is 63.4 MB and tesseract takes around 2 to 3 minutes before finishing. The accuracy is great but is slow.
Upvotes: 1
Views: 1518
Reputation: 8626
The font_properties
file of all tesseract
trained languages can be found in github. You may check the traditional chinese specific fonts supported from the list.
From tesseract-ocr/langdata
folder here in github, you can check the chi_tra.wordlist
inside chi_tra
folder to find the words used for training.
Upvotes: 1