Tika detects Tesseract but doesn't perform any OCR

Question

I just installed Tika from the Github's repository and tried to OCR a PDF which contains scanned document pages.

java -cp tika-app/target/tika-app-1.17-SNAPSHOT.jar org.apache.tika.cli.TikaCLI /tmp/testing/sample_scanned.pdf

However, only metadata gets extracted (although I got confirmation beforehand that Tesseract is installed and utilized:

WARNING: Tesseract OCR is installed and will be automatically applied to image files unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.

(Full output)

Note: Regular PDFs (containing) plain text gets extract successfully. The problem seems to be the OCR process itself.

This has been tested on Centos as well as Ubuntu - same issue.

Do I need to make changes to config files, specify more parsers? What could cause this?

Thank you.

Tika detects Tesseract but doesn't perform any OCR

Answers (1)

Related Questions

Tika detects Tesseract but doesn&#39;t perform any OCR

Answers (1)

Related Questions

Tika detects Tesseract but doesn't perform any OCR