kurochenko
kurochenko

Reputation: 1254

Jackrabbit deprecated SearchIndex textFilterClasses attribute

I'm configuring Jackrabbit 2.3.6 and I need to index binary files (PDF, ODT). So I've configured SearchIndex in repository.xml according to http://wiki.apache.org/jackrabbit/Search. But when I insert file into repository and try to full-text search, no results are returned.

Then I noticed warning in logs:

SearchIndex.java:2087 The textFilterClasses configuration parameter has been deprecated, and the configured value will be ignored: org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor

How do I have to configure SearchIndex to index binary data? Now I am doing it like this, which is deprecated and didn't work according to aforementioned warning:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
    <param name="path" value="${rep.home}/repository/index"/>
    <param name="textFilterClasses"value="org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor"/>
    <param name="supportHighlighting" value="true"/>
</SearchIndex>

Thanks for replies.

Upvotes: 2

Views: 685

Answers (2)

Ravish Bhagdev
Ravish Bhagdev

Reputation: 955

You don't need to do anything to turn Tika parsing on. As long as you add the mimetype property, it will automatically parse and index the content of the document (as long as the format is supported by particular version of Tika it is on).

Hope this helps someone. Jackrabbit documentation is really sparse. Fact that Apache Oak seems set to replace it doesn't help either.

Upvotes: 0

RobSis
RobSis

Reputation: 352

This is the answer to similar question from Mark Herman from Jackrabbit Users mailing list:

I'm not an expert but what I do know that JR uses Tika to extract text, and it determines how based on the jcr:mimeType property. If you don't supply mimetype, then it won't know how to extract it (although I wouldn't recommend that as a practice). I believe there is a way to supply JR with a Tika config that might give you what you want. EDIT: There isn't. It's hardcoded.

Additionally you can specify a indexing config in the repository/workspace xml files that you can set some rules on what gets indexed and how by lucene.

Upvotes: 1

Related Questions