Reputation: 27
UPDATE
*I have reinstalled tesseract into my 'program files (x86)' folder and now when I run tesseract --version
it responds with the version rather than saying it isn't recognized as a cmdlet
*
This seems to be a pretty common problem and have been trying different ways to make this program work. I know there are a lot of existing questions similar to mine but since none of the methods I have found work, I am hoping to get some fresh ideas. TIA
HERE IS THE EXACT ERROR MESSAGE:
"pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:\Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"
AND HERE IS THE CODE I CURRENTLY AM USING:
from pdf2image import convert_from_path
import pytesseract
images = convert_from_path("CHECK_12-01-22.pdf", 500, poppler_path=r'C:\Program Files\poppler-23.01.0\Library\bin')
for i, image in enumerate(images):
fname = 'image' + str(i) + '.png'
image.save(fname, "PNG")
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
text = pytesseract.image_to_string(image, lang='eng')
# text = pytesseract.image_to_string(image, lang='eng', config='--tessdata-dir "C:\\Program Files\\Tesseract-OCR\\tessdata"')
I am using Windows 11 and PyCharm.
I have Poppler working, that converts my PDF to Images but when I try to run Tesseract, it says there aren't any languages found. I have tried a few different methods to get it working. First my Environment Variables are set. image of environment variable path
Then I tried using config in my code.
text = pytesseract.image_to_string(image, lang='eng', config='--tessdata-dir "C:\\Program Files\\Tesseract-OCR\\tessdata"')
which also didn't work. I've downloaded different language data files and put them in the tessdata folder to no avail.
Upvotes: 2
Views: 2390
Reputation: 27
Here is the solution I was able to find
tessdata_dir_config = "--tessdata-dir 'C:\\Program Files (x86)\\Tesseract-OCR\\tessdata\\"
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
text = pytesseract.image_to_string(image, lang='eng', config=tessdata_dir_config)`
I initially was using an x32 bit version while I have a 64 bit operating system so I uninstalled Tessract-OCR and found the x86 bit version and reinstalled that to my program files (x86) folder. I had to point to the tessdata before calling the tesseract_cmd. I made the path into a variable that I was able to use as an argument while converting the image to text.
Upvotes: 0
Reputation: 3476
Have you set the system environment variable right? Check with the command:
echo $TESSDATA_PREFIX
In my environment the system variable is under:
you should see in this directory the eng files:
Upvotes: 0
Reputation: 11850
When starting a tesseract application the tessdata folder needs to be correctly found by tesseract.exe
There are many ways to do that so in a batch file I may use for a specific case such as MuPDF the first command line in a batch as
set TESSDATA_PREFIX=C:\Apps\PDF\mupdf\mupdf-1.21.0-windows-tesseract\mupdf-1.21.0-windows-tesseract\tessdata
OR I may have a prior fall-back pre set in user environment where I have a copy of eng.traineddata 22.4 MB 17/01/2023, 01:16:15
but to get that to stick for use both now (and in future) it sometimes needs log-out log-in to be used by the next command shell.
So in the above case on windows 10 I did NOT need to logout its available for fresh command shells, but beware shells started before that change, like some file commanders, that need stopping and re-starting.
Upvotes: 1