Alex
Alex

Reputation: 829

solr exclude html class from indexing

Im indexing a knowledgebase with solr. The problem is, that the menu is indexed as well, so searching for a term used in the menu returns all pages.
Can I somehow tell solr to exclude a special html class from indexing?
HTML-Tags are removed, so I cant find the specified element later.


EDIT:
I added a short sample for what I want to achieve.
That is, to exclude certain html nodes (like my navigation) from beeing indexed.

Sample html:

<nav>
    <ul>
        <li>topic-1</li>
        <li>topic-2</li>
        <li>topic-3</li>
    </ul>
</nav>
<main>
    <h1>Topic-1</h1>
    <p>Lorem ipsum dolor sit ament...</p>
</main>

What I currently get in my index from that:

topic-1
topic-2
topic-3

Topic-1
lorem ipsum dolor sit ament...

What I want to get in my index fom that:

Topic-1
lorem ipsum dolor sit ament...

Upvotes: 2

Views: 940

Answers (3)

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

You basically want to remove some of the text. You can do it on the field itself with PatternReplace Character Filter, which sits before the Tokenizer in the field type definition. That will keep it in the stored version of the field though.

Or, you could go earlier in the indexing process, and use UpdateRequestProcessor to modify the field before it is even looked at for indexing. You'd want RegexReplace URP for that.

Upvotes: 1

MatsLindh
MatsLindh

Reputation: 52852

Use the XPathEntityProcessor to extract a subset of the document, matched by the provided XPath pattern.

That way you can index the actual content you want in the page (as long as it's valid XML), and ignore other common stuff such as headers/footers/etc. as well.

Upvotes: 0

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8668

Use HTMLStripCharFilterFactory, which will strip HTML tags:

<analyzer>
  <charFilter class="solr.HTMLStripCharFilterFactory"/>
  <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>

Let me know if it works for yor.

Here you will find more info on the same.

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Upvotes: 0

Related Questions