user164863
user164863

Reputation: 641

Apache Tika App configuration file

I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents.

The [Apache Tika website][1] says the following:

Build artifacts

The Tika build consists of a number of components and produces the following main binaries:

tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6.

tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.

tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.

So I have downloaded the last verstion (1.18) of tika-app-*.jar. That was just a single file.

Running this in a command line like java -jar tika-app-1.18.jar -t <filename> gives me the needed output of the file content but also each time I get two warnings:

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.

I don't know if those warning slow things down but it is hard to follow other output amongst those repetative warnings.

I have tried to point Tika to my own configuration file by:

java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>

My tika-config.xml file is:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/x-sqlite3</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
    </parser>
  </parsers>
</properties>

If I use that config I get No protocol: filename.doc and the warnings are still in place.

How to exclude jpeg and sqlite parsers?

Upvotes: 6

Views: 7985

Answers (1)

aarkerio
aarkerio

Reputation: 2364

My solution was this tika-config.xml file:

 <?xml version="1.0" encoding="UTF-8"?>
 <properties>
   <service-loader loadErrorHandler="IGNORE"/>
   <service-loader initializableProblemHandler="ignore"/>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
    <mime-exclude>image/jpeg</mime-exclude>
    <mime-exclude>application/x-sqlite3</mime-exclude>
    <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
   </parser>
  </parsers>
  </properties>

and then set:

export TIKA_CONFIG=/path/to/tika-config.xml

in my .bashrc file.

Upvotes: 3

Related Questions