Reputation: 1254
I'm configuring Jackrabbit 2.3.6
and I need to index binary files (PDF,
ODT). So I've configured SearchIndex
in repository.xml
according to
http://wiki.apache.org/jackrabbit/Search. But when I insert file into repository and try to full-text
search, no results are returned.
Then I noticed warning in logs:
SearchIndex.java:2087 The textFilterClasses configuration parameter has
been deprecated, and the configured value will be ignored: org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
How do I have to configure SearchIndex
to index binary data? Now I am
doing it like this, which is deprecated and didn't work according to aforementioned warning:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${rep.home}/repository/index"/>
<param name="textFilterClasses"value="org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor"/>
<param name="supportHighlighting" value="true"/>
</SearchIndex>
Thanks for replies.
Upvotes: 2
Views: 685
Reputation: 955
You don't need to do anything to turn Tika parsing on. As long as you add the mimetype property, it will automatically parse and index the content of the document (as long as the format is supported by particular version of Tika it is on).
Hope this helps someone. Jackrabbit documentation is really sparse. Fact that Apache Oak seems set to replace it doesn't help either.
Upvotes: 0
Reputation: 352
This is the answer to similar question from Mark Herman from Jackrabbit Users mailing list:
I'm not an expert but what I do know that JR uses Tika to extract text, and it determines how based on the jcr:mimeType property. If you don't supply mimetype, then it won't know how to extract it (although I wouldn't recommend that as a practice). I believe there is a way to supply JR with a Tika config that might give you what you want. EDIT: There isn't. It's hardcoded.
Additionally you can specify a indexing config in the repository/workspace xml files that you can set some rules on what gets indexed and how by lucene.
Upvotes: 1