oak
oak

Reputation: 137

How do I ignore some HTML parts (such as header, menu, footer) from my Solr index?

I have been using Nutch + Solr (4.3.0) to index a site, with the schema.xml provided by Nutch.

My problem is that when I do a search that includes some words that occur on my header or menu, Solr responds with all pages, obviously.

What I want is to remove these HTML blocks from the index so that the search doesn't include those 'false positives' so to speak.

I was trying something like:

<charFilter class="solr.PatternReplaceCharFilterFactory" 
      pattern="HEADER STARTS(.*?)HEADER ENDS" replacement="" />

applied to the index analyzer of my content fieldType, being "HEADER STARTS/ENDS" HTML comments, but it appears to have no effect at all.

I couldn't find anything better googling... but I am a real newbie to this tech stack.

Any help would be welcome!

Thanks!!!

Upvotes: 0

Views: 902

Answers (2)

Persimmonium
Persimmonium

Reputation: 15791

you might have a look at boilerpipe. It is a java library specifically suited for that issue. I used in a project with good results, but I used it with plain Lucene. For Solr integration, there is an open issue

Upvotes: 2

Jayendra
Jayendra

Reputation: 52799

The NUTCH-585 which was committed and should be available with the trunk and latest Nutch version should suffice your needs.

Upvotes: 1

Related Questions