zecheng zhao
zecheng zhao

Reputation: 69

query solr without html tags?

I have sent to Solr the following data:

{
    "id":"kkk",
    "name":"<div>book</div>"
}

after the solr receive data , if i search "div" , the result doesn't display, but when i search "book" , the result will display , how can i do ? Here is my schema:

<field name="name" type="text_html" indexed="true" stored="true"/>

<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
    </analyzer>
</fieldType>

The solr can only strip the html tags when i do the index , if i want to send solr the data directly , how can i strip the html tags?

Upvotes: 0

Views: 832

Answers (1)

freedev
freedev

Reputation: 30107

What you see in the field name as result of your Solr query, is not what's is really indexed by Solr.

The <charFilter class="solr.HTMLStripCharFilterFactory"/> filter will remove the HTML tags.

Only after all the filters/tokenizers are execute the content is really indexed by Lucene.

Have a look at the Solr Admin Analysis Tool to better understanding what's going on.

In conclusion, for each field there are two contents:

  • a stored content (stored="true") which is the source text passed to index (and that's is returned to the user when a document match the query constraints.
  • an indexed content (indexed="true") which is the source content after being processed by the token/filters which is then used for the information retrieval part.

AFAIK, there is no way to modify the stored (source) content after processed, as said this is the source of field, so if you want modify the source just prepare it before submitting to Solr.

Upvotes: 2

Related Questions