Reputation: 776
I have installed the pytesseract
module in my venv
and want to extract text from a German image.
Eith executing this script from pytesseract and setting the language to German
import cv2
import pytesseract
try:
from PIL import Image
except ImportError:
import Image
print(pytesseract.image_to_string(Image.open('test.jpg')))
print(pytesseract.image_to_string(Image.open('test.jpg'), lang='ger'))
which gives me
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Error opening data file C:\\Program Files (x86)\\Tesseract-OCR/tessdata/ger.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language \'ger\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
I have found the language data on [tessdoc/Data-Files] (https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md)
So far I only found a guide for Linux How do I install a new language pack for Tesseract on 16.04
Where do I need to move the language files in my pyteseract
sitepackage to get the script working?
Upvotes: 8
Views: 21918
Reputation: 1
I just apt-get to install, without set env TESSDATA_PREFIX.It works
apt-get install tesseract-ocr-YOUR_LANG_CODE
Upvotes: 0
Reputation: 21
Best way I've found:
tesseract-ocr-w64-setup-v5.0.0-rc1.20211030.exe
.fas.traineddata
.tessreact_ocr
installation location, some location like: C:\Program Files\Tesseract-OCR\tessdata
traineddata
name for the language. For Farsi, I use lang='fas'
.Upvotes: 2
Reputation: 780
There are two ways.
apt-get install tesseract-ocr-YOUR_LANG_CODE
for example- in my case it was Bengali so I installed -
apt-get install tesseract-ocr-ben
or for installing all languages -
apt-get install tesseract-ocr-all
.This worked for me Ubuntu environment.
TESSDATA_PREFIX
that point to the langauge pack. You can download the language pack from here: https://github.com/tesseract-ocr/tessdata
.Once you have downloaded the datapack you can also programmatically set the environment variable as
import os
os.putenv('TESSDATA_PREFIX','path/to/your/tessdata/file'
Upvotes: 3
Reputation: 776
found a guide to do this on a german site Python Texterkennung: Bild zu Text mit PyTesseract in Windows
Upvotes: 0