Reputation: 6485
I have a pdf (or any other type of files such as .doc, .ppt, etc) which contain text as well as images. How can I extract images from those files using Tika?
Can also run OCR on the extracted images using Tess4j or any other lib?
This is how I call Tika:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(writeLimit);
Metadata metadata = new Metadata();
InputStream stream = new FileInputStream("file.pdf");
parser.parse(stream, handler, metadata);
p.s. I have tika-app.jar.
Upvotes: 2
Views: 7563
Reputation: 198
The way to do this:
InputStream stream = new FileInputStream(inputFile);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(
Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); // need to add this to make
// sure recursive parsing
// happens!
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata, parseContext);
String text = handler.toString().trim();
1) Ensure that you have tesseract installed using 'tesseract-ocr-setup-3.05.00dev.exe' from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and have its path (It will get installed in the program files, if windows) is placed in the PATH environment variable. Restart Windows if needed. Pass any (yes any!) file and it will extract. 2) Download tess4j-3.0.0.jar from: https://sourceforge.net/projects/tess4j/?source=typ_redirect and refer this jar using:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.0.0</version>
</dependency>
then, these:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.13</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.13</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/net.java.dev.jna/jna -->
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.11</version>
</dependency>
However, if using Ubuntu, tesseract should be installed using apt-get. It will work.
Upvotes: 4