henrythewasp
henrythewasp

Reputation: 43

Is there a way to turn off parsing of embedded docs in the tika-server?

I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header) from our application.

Since Tika 1.15, the default behaviour is now to "extract all embedded documents" TIKA-2096.

I want to be able to turn this behaviour off on our tika-server so that embedded documents are NOT extracted and I only get the text rendition of the main document contents.

Is it possible to do this via a tika-config.xml file, or do I need to do a custom build and subclass EmbeddedDocumentExtractor so that it doesn't do anything?

An answer to tika-parser-exclude-pdf-attachments indicates that you can turn this behaviour off by subclassing EmbeddedDocumentExtractor, but I'd like to check if it's possible to do this via tika-config.xml without having to do a custom build of the tika-server.

I have looked at Configuring Tika but there is no mention of embedded docs here.

Upvotes: 4

Views: 1723

Answers (1)

Dave Meikle
Dave Meikle

Reputation: 266

The answers in tika-parser-exclude-pdf-attachments are excellent for if you are calling Tika via code.

Previously there hasn't been a way to do this for embedded files in Tika Server, other than disabling the whole file type using EmptyParser with something like the below:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.EmptyParser">
            <mime-exclude>image/jpeg</mime-exclude>
            <mime-exclude>application/zip</mime-exclude>
        </parser>
    </parsers>
</properties>

This has become a common request, so I've added a feature coming up in Tika 1.25 (yet to be released) to allow for the skipping embedded files using a header setting:

curl -T test_recursive_embedded.docx http://localhost:9998/tika --header "Accept: text/html" --header "X-Tika-Skip-Embedded: true"

Any parser using the EmbeddedDocumentExtractor will honour this.

Upvotes: 3

Related Questions