Plug4
Plug4

Reputation: 3938

Python 2.7: Difficulty using pypdfocr for Windows 7

I am trying to use pypdfocr in Windows 7 with Python 2.7.

This is the ERROR Message I get when I try pypdfocr in cmd:

C:\Users\chamar.stu>pypdfocr F:\test2.pdf Starting conversion of F:\test2.pdf 'pdfimages' is not recognized as an internal or external command, operable program or batch file. WARNING: Could not execute pdfimages to calculate DPI (try installing xpdf or po ppler?), so defaulting to 300dpi Traceback (most recent call last): File "c:\users\chamar.stu\appdata\local\continuum\anaconda2\lib\runpy.py", line 174, in _run_module_as_main ... .... ....

pypdfocr\pypdfocr_tesseract.py", line 98, in _is_version_uptodate ver = [int(x) for x in ver_str.split('.')] ValueError: invalid literal for int() with base 10: '00alpha'

It seems that I am missing Poppler or XPDF but I did install Poppler via PyGoObject as suggested here. I've also link xpdf in my environmental path as suggested here.

Any suggestions to get me out of this little mess?

Upvotes: 2

Views: 394

Answers (2)

Eduard Florinescu
Eduard Florinescu

Reputation: 17541

Try downgrading Tesseract from version 4.0.0-beta.1(my case) to version 3.x that doesn't contain alphanumericals in the name.

tesseract --version #to check

The version check built into the pypdfocr package is expecting the version numbers to be integers, hence the error on '00alpha' ('0-beta' in my case)

Upvotes: 0

Roland Smith
Roland Smith

Reputation: 43533

The pypdfocr script is probably calling the pdfimages program (one of the poppler utilities, not the library) using the subprocess module.

I could not easily discern if the utilities were provided in the URI you mention.

If not, you can find pre-built ms-windows executables for the utilities e.g. here.

Make sure that the location where the poppler utilities are installed is in your PATH, so that pypdfocr can find it.

Upvotes: 1

Related Questions