Reputation: 641
I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents.
The [Apache Tika website][1] says the following:
Build artifacts
The Tika build consists of a number of components and produces the following main binaries:
tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6.
tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.
tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
So I have downloaded the last verstion (1.18) of tika-app-*.jar
. That was just a single file.
Running this in a command line like java -jar tika-app-1.18.jar -t <filename>
gives me the needed output of the file content but also each time I get two warnings:
July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.
July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.
I don't know if those warning slow things down but it is hard to follow other output amongst those repetative warnings.
I have tried to point Tika to my own configuration file by:
java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>
My tika-config.xml file is:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/x-sqlite3</mime-exclude>
<parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
</parser>
</parsers>
</properties>
If I use that config I get No protocol: filename.doc
and the warnings are still in place.
How to exclude jpeg and sqlite parsers?
Upvotes: 6
Views: 7985
Reputation: 2364
My solution was this tika-config.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<service-loader loadErrorHandler="IGNORE"/>
<service-loader initializableProblemHandler="ignore"/>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/x-sqlite3</mime-exclude>
<parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
</parser>
</parsers>
</properties>
and then set:
export TIKA_CONFIG=/path/to/tika-config.xml
in my .bashrc file.
Upvotes: 3