ihristov
ihristov

Reputation: 137

Solr wildcard search incorrect result

I have some unexpected results when i make wildcard queries. I am using solr 6.6.0. edismax handler inside solr ui. The following query return results as expected without wildcard - firstNames:James, but when i add wildcard there are no results found. without wildcard with wildcard For firstNames field i use default fieldType text_en with default tokenizers and filters. When i run exact same query for firstNames:Stephen and firstNames:Stephen* i got results in both wildcard and not wildcard searches. Below is my field xml inside schema.xml:

  <field name="firstNames" type="text_en" multiValued="true" indexed="true" stored="true"/>
  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>

Upvotes: 3

Views: 1740

Answers (4)

EricLavault
EricLavault

Reputation: 16035

So probably the OP don't need stemming here (on a name field), but for the general case,

One can make wildcards queries and fuzzy search work properly with stemmers, possessive filters, and any other filters that may truncate tokens, by adding the KeywordRepeatFilterFactory before these filters in the analysis chain, so that both original and stemmed tokens get indexed :

Emits each token twice, one with the KEYWORD attribute and once without.

If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.

This allows to avoid having to define two distinct field types (stemmed vs unstemmed) or to use an ngram filter for the sole purpose of fixing wildcard queries.

Upvotes: 1

Dominique Bejean
Dominique Bejean

Reputation: 11

About stopwords, the response to the question "do I have to use stopwords" is not "yes" or "no". It is "why not" but intelligently according what your datas are. For a drug database, "a", "b", "c" ... shouldn't be in the stopwords definition file. For a movie titles which are 100% stop words database, the title field must not use stopwords, but maybe the description field should.

Upvotes: 1

MatsLindh
MatsLindh

Reputation: 52802

When you're doing a wildcard query the analysis chain is not invoked (well, that's a small lie - it is, but only the components that are MultiTermAware - which usually means that the LowercaseFilter is the only thing that still is active).

Since you have a stemming filter and the possessive filter attached, the end s on James is removed. Since this only happens on index time (remember, when you're using a wildcard, the analysis chain is generally skipped on query), the token jame is stored in the index.

When you make the query firstNames:James*, you ask Solr to "find any document that contains tokens that start with James. Since what was stored is the token jame, there are no tokens matching james.

When you try this with Stephen instead, neither stemming or possessive filter removes the end of the word, so Stephen* looks for any token starting with stephen, and since that token is present (nothing got changed), a match is returned.

The solution depends on your use case; there is no need for a stemming or possessive filter on a name field, since that doesn't really make sense for names (instead you might apply your own logic to match similar-ish names). Another option is to use an ngramfilter instead, effectively generating a token for each prefix and infix version of the token (foo, f, fo, oo, o).

Upvotes: 4

Walter Underwood
Walter Underwood

Reputation: 1221

  1. Do not remove stop words. That is a space-saving hack from the 1970s. It makes some words unsearchable, so queries like "vitamin a" will never work because "a" is a stop word. Here is a blog post listing movie titles which are 100% stop words.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

  1. Do not use wildcards with stemming. That will show matches on the stem, not the surface word. You want a separate field with just the lowercase filter.

  2. Do not use stemming on personal names. You do not want to stem "Steve Jobs" to "steve job" or "william golding" to "william gold", for example.

  3. Even better, use the ICU Folding Filter instead of just lowercasing.

https://lucene.apache.org/solr/guide/8_7/filter-descriptions.html#icu-folding-filter

Upvotes: 3

Related Questions