Dunski
Dunski

Reputation: 672

How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly. When I try to send a pdf with an image on it I get the following.

WARNING: Tesseract OCR is installed and will be automatically applied to image f iles unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.

Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.

There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.

Any help, greatly appreciated.

Upvotes: 3

Views: 4822

Answers (3)

MatthewFord
MatthewFord

Reputation: 2926

I would recommend using ocrStrategy auto

This tries to extract and then falls back onto OCR

Upvotes: 0

SsshirazzZ
SsshirazzZ

Reputation: 21

I tried user3250052's approach but I was unable to recompress the jar file in a way that was executable. That's owing to my own inexperience with Java, but regardless, the less hacky way is to call a custom tika config file when calling tika:

java -jar tika-app.jar --config=tika-config.xml image.pdf

This is what my tika-config.xml looks like:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  <!--for example: <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
  <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
  <encodingDetectors>
    <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
  </encodingDetectors>
  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
  <detectors>
    <detector class="org.apache.tika.detect.DefaultDetector"/>
  </detectors>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="extractInlineImages" type="bool">true</param>
      </params>
    </parser>
  </parsers>
</properties>

To build that that config file, first I ran:

java -jar tika-app.jar --dump-current-config

That will dump for you the default config. I took that and put it into tika-config.xml and added:

<parser class="org.apache.tika.parser.pdf.PDFParser">
  <params>
    <param name="extractInlineImages" type="bool">true</param>
  </params>
</parser>

which I gleaned from https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) (option 1).

Even though tesseract is enabled by default (so OCR will work out of the box on image files), PDFs do not get OCRed without that option set because, as noted in the above link, "by default, extracting inline images is turned off because some rare PDFs contain thousands of inline images per page, and it has a big hit on performance, both memory usage and time".

Now everything (OCR on image files, OCR of images in or image-based PDFs, and also naturally text extraction of text-based PDFs) works with the java app tika. I found plenty of documentation on getting this to work on the java server tika but very little on the java app tika, so I'm hoping this saves someone the few hours it took me to figure that out (let me know).

Upvotes: 2

Dunski
Dunski

Reputation: 672

OK so with the help of this post on the Apache Tika Forum Thank you guys.

I managed to get it working. Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this

extractInlineImages true 
extractUniqueInlineImagesOnly false 
ocrStrategy ocr_and_text_extraction

Then locate TesseractOCRConfig.properties. And change this one property to 1..

enableImageProcessing=1

Save the above properties files. Zip it all up again. And use your new zipped up jar file and it will now extract text and text from images from a pdf file.

Upvotes: 3

Related Questions