Reputation: 69
I have sent to Solr the following data:
{
"id":"kkk",
"name":"<div>book</div>"
}
after the solr receive data , if i search "div" , the result doesn't display, but when i search "book" , the result will display , how can i do ? Here is my schema:
<field name="name" type="text_html" indexed="true" stored="true"/>
<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
</analyzer>
</fieldType>
The solr can only strip the html tags when i do the index , if i want to send solr the data directly , how can i strip the html tags?
Upvotes: 0
Views: 832
Reputation: 30107
What you see in the field name
as result of your Solr query, is not what's is really indexed by Solr.
The <charFilter class="solr.HTMLStripCharFilterFactory"/>
filter will remove the HTML tags.
Only after all the filters/tokenizers are execute the content is really indexed by Lucene.
Have a look at the Solr Admin Analysis Tool to better understanding what's going on.
In conclusion, for each field there are two contents:
stored="true"
) which is the source text passed to index (and that's is returned to the user when a document match the query constraints.indexed="true"
) which is the source content after being processed by the token/filters which is then used for the information retrieval part. AFAIK, there is no way to modify the stored (source) content after processed, as said this is the source of field, so if you want modify the source just prepare it before submitting to Solr.
Upvotes: 2