Reputation: 63
I just installed Tika from the Github's repository and tried to OCR a PDF which contains scanned document pages.
java -cp tika-app/target/tika-app-1.17-SNAPSHOT.jar org.apache.tika.cli.TikaCLI /tmp/testing/sample_scanned.pdf
However, only metadata gets extracted (although I got confirmation beforehand that Tesseract is installed and utilized:
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Note: Regular PDFs (containing) plain text gets extract successfully. The problem seems to be the OCR process itself.
This has been tested on Centos as well as Ubuntu - same issue.
Do I need to make changes to config files, specify more parsers? What could cause this?
Thank you.
Upvotes: 0
Views: 2751
Reputation: 715
Turns out PDF image extraction is disabled by default. From PDFParserConfig
:
Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to
true
with caution. The default isfalse
.
A simple example to enable it that worked for me:
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
ParseContext parseContext = new ParseContext();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
parseContext.set(PDFParserConfig.class, pdfConfig);
try (InputStream stream = ClasspathUtil.readStreamFromClasspath("test.pdf")) {
parser.parse(stream, handler, new Metadata(), parseContext);
System.out.println(handler.toString());
}
Upvotes: 1