Jimm
Jimm

Reputation: 8505

How to correctly configure solr stemming

I have configured a field in Solr as follows. When I search for the word "Conditioner", I was hoping to find words that contain "Conditioning" also. But based on Solr Analysis, the porterstemfilter is cutting the word "Conditioning" to "Condit" at index time. Hence, at the search time, when I query for "Conditioner", it is stemmed as "Condition" and hence not matching "Conditioning".

How to configure stemming so that both Conditioner and Conditioning should stem to condition?

<fieldType name="text_general" class="solr.TextField"
           positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" generateNumberParts="1" 
            catenateWords="1" catenateNumbers="1" catenateAll="0" 
            splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="0" catenateNumbers="0" catenateAll="0"
            splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Upvotes: 2

Views: 1813

Answers (2)

cheffe
cheffe

Reputation: 9500

I would also suggest to try a different Stemmer. There are 4 included in Solr

  1. solr.PorterStemFilterFactory
  2. solr.SnowballPorterFilterFactory
  3. solr.KStemFilterFactory
  4. solr.HunspellStemFilterFactory (you will need a dictionary for this one from an external source, like open office)

Each of those produces different results for your problem, see below. Given the results and that you do not need an external resource, I would also opt for KStem. If you do not fear to include a dictionary, I would go for hunspell.

  1. porter
    • Conditioner -> condition
    • Conditioning -> condit
  2. snowballporter
    • Conditioner -> condition
    • Conditioning -> condit
  3. kstem
    • Conditioner -> condition
    • Conditioning -> condition
  4. hunspell with en_GB
    • Conditioner -> condition
    • Conditioning -> conditioning; condition

Upvotes: 4

Holger
Holger

Reputation: 151

If only this particular case is important, you could override the stemmer:

StemmerOverrideFilterFactory

If the Porter stemmer is generally too aggressive, then try another stemmer like KStem.

Upvotes: 1

Related Questions