Reputation: 33
I am working on python tesseract package with sample code like the follows:
import pytesseract
from PIL import Image
tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)
And I received the following error message:
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')
From my understanding, the error occurred when reading the file chi-sim.traineddata
(which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.
tesseract
and tesseract-lang
from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I callprint(pytesseract.get_languages(config = ""))
I get a long list of languages printed, including chi-sim.
text = pytesseract.image_to_string(image)
TESSDATA_PREFIX
in multiple ways, including:Using config
parameter as in the original code.
Adding global environment variable in PyCharm.
Adding the following line in the code
os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
bash_profile
in terminalexport TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/
But unfortunately, none of these works.
chi-sim.traineddata
is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata
is located). Yes, I've tried both, but neither works.With respect to this issue, is there any potential solutions?
Upvotes: 3
Views: 3100
Reputation: 143197
Code works for me on Linux if I use lang="chi_sim"
with _
instead of -
because file downloaded from server has name chi_sim.traineddata
also with _
instead of -
.
If I rename file into chi-sim.traineddata
then I can use lang="chi-sim"
(with -
instead of _
)
Upvotes: 1