Henry
Henry

Reputation: 411

Pytesseract: Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata

I am trying to use pytesseract on Jupyter Notebook.

When I run the following code:

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en', config = tessdata_dir_config))

I get the following error:

TesseractError                            Traceback (most recent call last)
<ipython-input-37-c1dcbc33cde4> in <module>()
     11 # tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
     12 
---> 13 print(pytesseract.image_to_string(Image.open('Multi_page24bpp.tif'), lang='en'))
     14 # print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

C:\Users\cpcho\AppData\Local\Continuum\Anaconda3\lib\site-packages\pytesseract\pytesseract.py in image_to_string(image, lang, boxes, config)
    123         if status:
    124             errors = get_errors(error_string)
--> 125             raise TesseractError(status, errors)
    126         f = open(output_file_name, 'rb')
    127         try:

TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata')

I found these two references helpful but I am missing something: https://github.com/madmaze/pytesseract/issues/50 https://github.com/madmaze/pytesseract/issues/64

Thank you for your time on this!

Upvotes: 5

Views: 20090

Answers (4)

Day 1 -all works; Day 2 -this error; on second computer all works... 5 hours later: ===i find ANSWER in my mind===

From "C:\Program Files\Tesseract-OCR\tessdata" copy 'eng.traineddata' to "C:\Program Files\Tesseract-OCR"

its work =\

Upvotes: -2

sam
sam

Reputation: 2311

If you don't want to set environment variable you can pass as an argument as well

For example:

First, do your imports

    import pytessetact
    from PIL import Image

And now configure pytesseract

    pytesseract.pytesseract.tesseract_cmd = "C:/path_to_your_tesseract.exe"
    tessdata_dir_config = '--tessdata-dir "C:/path_to_your_tessdata_folder"'

    pytesseract.image_to_string(image, config=tessdata_dir_config)

Upvotes: 3

YingTai.Z
YingTai.Z

Reputation: 21

I faced the same problem. I tried all solutions on Google, without success. Finally, I solved the problem by replacing.

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe' 

with

pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'.

Upvotes: 2

thewaywewere
thewaywewere

Reputation: 8626

From your post, observed two possible issues.

  1. All the trained language data should be saved in TESSDATA_PREFIX, a Windows environmental variable, which is at C:\Program Files (x86)\Tesseract-OCR\tessdata in your case.

  2. The tesseract trained English data is named eng.traineddata (i.e. 'eng') unless you modified its name. Refer to this Tesseract Data Files for more information.

In addition, for pytesseract to read the image file Image.open(), you may include the full file path (e.g. 'z:\\path\\to\\image') if the image file is unable to locate.

Hope to this.

Upvotes: 1

Related Questions