Special Characters Stored When Extracting Content From Microsoft Word Documents (.doc)

Question

I'm extracting content from Microsoft Word 97-07 documents (.doc) and storing them into a field in Solr (in order to show context snippets for highlighting). It seems like the content that is extracted is not properly filtered; lots of special characters are stored, while I only want to store the content in plaintext. When I print out the snippets it looks like this:

Context Snippets From Ms Word .doc

Is there any way to filter out/strip the special characters? It would also be nice - but not necessary - to be able to remove the text that turns out to be function names as well, like NUMPAGES.

I have the following ExtractingRequestHandler that I use:


  
    true
    ignored_

    
    true
    links
    ignored_

The RequestHandler is used via SolrJ, with these parameters:

up.setParam("fmap.content", "file_content");
up.setParam("fmap.title", "title_text");

and the file_content field is defined like this:

and although I don't think the field type matters (because it is not indexed) I will put it here anyway:

Edit: I forgot to mention that I'm using SOLR 4.4.0 which comes with Tika 1.4

a h · Accepted Answer

It turns out this is partially fixed in Tika 1.5.

This is what it looks like now

I say partially fixed, because there are still some special characters related to dynamic page numbering in Table of Contents.

According to the nice people on #solr on Freenode, Apache Tika 1.5 is supposed to be packaged with Solr 4.8.0. As a temporary fix before 4.8.0 is released, I simply downloaded Tika 1.5 and put tika-core-1.5.jar and tika-parsers-1.5.jar in the contrib/extraction/lib directory of Solr. I also had to delete the old files, namely tika-core-1.4.jar and tika-parsers-1.4.jar. It seems to work flawlessly so far.

Special Characters Stored When Extracting Content From Microsoft Word Documents (.doc)

Answers (1)

Related Questions