user3286012
user3286012

Reputation: 141

Exact Match without special characters in Solr

My current field type in schema is currently defined to do exact match only;

<fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.TrimFilterFactory"/>
      </analyzer>
      <analyzer type="query">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.TrimFilterFactory"/>
      </analyzer>
   </fieldType>

Now, I want to implement an exact match but special characters are removed during indexing.

I read that using StandardTokenizerFactory would remove the special characters. However, I don't want the side effect of it splitting the phrase on white spaces.

Is it possible to do StandardTokenizerFactory during indexing and then using in query KeywordTokenizerFactory?

Any other ideas?

Upvotes: 1

Views: 3467

Answers (1)

Mysterion
Mysterion

Reputation: 9320

You could use CharFilterFactories from Solr, there possible suitable factories for you:

solr.HTMLStripCharFilterFactory: it will remove all html special characters, like <, >, &, etc.

solr.PatternReplaceCharFilterFactory: it will replace all characters, you could use it like regexp:

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-z])" replacement=""/>

it will remove all non alphabetic chars, similar to this you could remove all your special characters.

For more info - https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories

Upvotes: 1

Related Questions