Zach Dean
Zach Dean

Reputation: 11

Storing Raw HTML Files in Solr

I have Solr 5.4.1 and I am trying to index and store html files. I would like to store the raw HTML so that I can use it for highlighting.

Is there any way to do this? My update/extract request handler uses Tika, which I believe is stripping the html tags from my files and so would like to avoid this for storing the raw html content.

Thanks in advance

Upvotes: 0

Views: 1152

Answers (1)

Matt Pearce
Matt Pearce

Reputation: 484

The easiest way to search HTML content in Solr is to index using the HTMLStripCharFilterFactory. This strips HTML tags (including attributes) from the text at index time, meaning you can search the text without also searching the tags. The stored version of the field will still have the HTML tags included.

<!-- Field type for HTML fields, stripping HTML characters during indexing -->
<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

However, this can result in the highlighting markup causing your HTML tags to break, either by appearing in the middle of HTML tags, or cutting out closing tags. An alternative solution to this is to strip the HTML before storing in Solr.

Upvotes: 3

Related Questions