Reputation: 137
I have been using Nutch + Solr (4.3.0) to index a site, with the schema.xml provided by Nutch.
My problem is that when I do a search that includes some words that occur on my header or menu, Solr responds with all pages, obviously.
What I want is to remove these HTML blocks from the index so that the search doesn't include those 'false positives' so to speak.
I was trying something like:
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="HEADER STARTS(.*?)HEADER ENDS" replacement="" />
applied to the index analyzer of my content fieldType, being "HEADER STARTS/ENDS" HTML comments, but it appears to have no effect at all.
I couldn't find anything better googling... but I am a real newbie to this tech stack.
Any help would be welcome!
Thanks!!!
Upvotes: 0
Views: 902
Reputation: 15791
you might have a look at boilerpipe. It is a java library specifically suited for that issue. I used in a project with good results, but I used it with plain Lucene. For Solr integration, there is an open issue
Upvotes: 2