EGN
EGN

Reputation: 2572

Which version of Tesseract to use for training a new language?

I'm seeking advice on which version of Tesseract should I use to train for an ancient language that has unique letters. The language is very similar to Arabic in terms of characteristics. It also goes from right-to-left and some letter can connect in the word. In other words, a letter might have three shapes depending if it comes in the beginning, middle or end. It also has harakat (short vowel marks) that come above or below letters.

The reason I'm asking is because I want to take advantage of the tools available for version 3.X but this warning about Arabic threw me off since this language is very similar to it.

For anyone who's familiar with Tesseract, which version do you recommend to train for such a language? Also, if you are aware of a better tool, kindly share it please.

Upvotes: 0

Views: 727

Answers (1)

thewaywewere
thewaywewere

Reputation: 8626

If you have a large amount of documents need to OCR, would recommend to use Tesseract 4.0 as it's faster in general. You may refer to below for more information in case you haven't read that before.

  1. Tesseract 4.0 Accuracy and Performance
  2. Tesseract 4.0 with LSTM
  3. Training Tesseract 4.0
  4. Language Data File for 4.0, you may have a test to see if the Arbic OCR works fine in OCR Engine Mode 1 (i.e --oem 1) which is Neural nets LSTM only.

Tesseract 4.0.0 alpha has been released since last Nov/Dec.

Hope this help.

Upvotes: 2

Related Questions