Reputation: 207
Using the solr.jar with the example in the download for Apache Solr 3.6, the HTML tags are not getting stripped.
In schema.xml I added the following:
<!-- A text field that only splits on whitespace for exact matching of words -->
<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
<field name="title" type="text_html" indexed="true" stored="true" multiValued="true"/>
Also, I posted the following JSON to SOLR:
[
{
"id" : "978-064172344522",
"title":"my <a href=\"www.foo.bar\">link</a> power-shot PowerShot USC Utility <br>hello</br> Rejections Under 35 U.S.C. 101 and 35 U.S.C. 112, First Paragraph Petitions to correct inventorship of an issued patent are decided by the <Underline>Supervisory Patent Examiner</Underline>, as set forth"
}
]
After restarting SOLR, I conducted a search for power-shot and the results still show the HTML tags
<result name="response" numFound="1" start="0" maxScore="0.13561106">
<doc>
<float name="score">0.13561106</float>
<str name="id">978-064172344522</str>
<arr name="title">
<str>my <a href="www.foo.bar">link</a> power-shot PowerShot USC Utility <br>hello</br>
What is missing here?
Upvotes: 0
Views: 336
Reputation: 15789
what you see is the field Stored as originally sent to Solr. If you search for 'title:href' for example, you should not find the document, as in the Analyzer chain html stuff should be removed
Upvotes: 2