Tesseract TessData fonts used for training

Question

I am using tesseract for OCR in an Android app. I am focusing on the Chinese language but I only need to recognise a few keywords so I was thinking of creating my .traineddata files using jTessBoxEditor. I wanted to know what fonts does the Chinese Traditional TessData file use? https://github.com/tesseract-ocr/tessdata

Alternatively, is there a way that I can edit the chi_tra.traineddata file so it only recognises a few keywords? The main reason I am doing this is because the file size is 63.4 MB and tesseract takes around 2 to 3 minutes before finishing. The accuracy is great but is slow.

thewaywewere · Accepted Answer

The font_properties file of all tesseract trained languages can be found in github. You may check the traditional chinese specific fonts supported from the list.

From tesseract-ocr/langdata folder here in github, you can check the chi_tra.wordlist inside chi_tra folder to find the words used for training.

Tesseract TessData fonts used for training

Answers (1)

Related Questions