jabawaba
jabawaba

Reputation: 279

In Solr, why is 'built' not being stemmed to 'build' but 'building' is?

I'm trying to figure out two things in this posting:

  1. Why is 'built' NOT being stemmed to 'build' even though the field type definition has a stemmer defined. However, 'building' is being stemmed to 'build'

  2. How to use Luke to examine the index to see which words got stemmed and to what. I wasn't able to see 'building' being stemmed 'build' in Luke. I know Lucene is stemming it because I am able to successfully retrieve the row with 'building' by searching for 'build'.

This link was pretty helpful but didn't answer my questions.

For reference, here is the schema.xml portions.

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal.
      add enablePositionIncrements=true in both the index and query
      analyzers to leave a 'gap' for more accurate phrase queries.
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

and the field definition is

<field name="features" type="text_en" indexed="true" stored="true" multiValued="true"/>

The data set consists of multiple documents, 1 document has 'building' in the features field, 1 documents has 'built' in the same field, and 1 document has 'Built-in' in the features field:

file : hd.xml:

<field name="features">building NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor</field>

file ipod_video.xml:

<field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field>

file sd500.xml:

 <field name="features">built in flash, red-eye reduction</field>

Using Lukeall-3.3.0, This is the result I get from searching for 'features:build'. Notice that I get back 1 (instead of the expected 3 documents) enter image description here Even within that one document, I don't see the stemming, ie, I only see the original word, 'building' as shown: enter image description here

and, again in Luke, searching for 'features:built', returns two documents: enter image description here

Selecting one of them, shows the original 'built' but not 'build'. enter image description here

Upvotes: 3

Views: 1860

Answers (1)

Robert Muir
Robert Muir

Reputation: 3195

For exceptional cases like this, you can tune the stemming algorithm with StemmerOverrideFilter

Upvotes: 2

Related Questions