Anemonee
Anemonee

Reputation: 33

Pytesseract Failed loading language 'chi-sim'

I am working on python tesseract package with sample code like the follows:

import pytesseract
from PIL import Image

tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)

And I received the following error message:

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')

From my understanding, the error occurred when reading the file chi-sim.traineddata (which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.

print(pytesseract.get_languages(config = ""))

I get a long list of languages printed, including chi-sim.

text = pytesseract.image_to_string(image)
  1. Using config parameter as in the original code.

  2. Adding global environment variable in PyCharm.

  3. Adding the following line in the code

os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
  1. Adding the following line to bash_profile in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/

But unfortunately, none of these works.

With respect to this issue, is there any potential solutions?

Upvotes: 3

Views: 3100

Answers (1)

furas
furas

Reputation: 143197

Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -.


If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)

Upvotes: 1

Related Questions