Reputation: 143
I'm extracting content from Microsoft Word 97-07 documents (.doc) and storing them into a field in Solr (in order to show context snippets for highlighting). It seems like the content that is extracted is not properly filtered; lots of special characters are stored, while I only want to store the content in plaintext. When I print out the snippets it looks like this:
Is there any way to filter out/strip the special characters? It would also be nice - but not necessary - to be able to remove the text that turns out to be function names as well, like NUMPAGES
.
I have the following ExtractingRequestHandler that I use:
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
The RequestHandler is used via SolrJ, with these parameters:
up.setParam("fmap.content", "file_content");
up.setParam("fmap.title", "title_text");
and the file_content
field is defined like this:
<field name="file_content" type="text_printable" stored="true"/>
and although I don't think the field type matters (because it is not indexed) I will put it here anyway:
<fieldType name="text_printable" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ScandinavianFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ScandinavianFoldingFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
</analyzer>
</fieldType>
Edit: I forgot to mention that I'm using SOLR 4.4.0 which comes with Tika 1.4
Upvotes: 2
Views: 502
Reputation: 143
It turns out this is partially fixed in Tika 1.5.
I say partially fixed, because there are still some special characters related to dynamic page numbering in Table of Contents.
According to the nice people on #solr on Freenode, Apache Tika 1.5 is supposed to be packaged with Solr 4.8.0. As a temporary fix before 4.8.0 is released, I simply downloaded Tika 1.5 and put tika-core-1.5.jar
and tika-parsers-1.5.jar
in the contrib/extraction/lib
directory of Solr. I also had to delete the old files, namely tika-core-1.4.jar
and tika-parsers-1.4.jar
. It seems to work flawlessly so far.
Upvotes: 1