Evren Ozkan
Evren Ozkan

Reputation: 103

Solr substring search with whitespace

I want to find "john doe" with "hn do" search. "*hn*" or "john\ d\*" works but when query includes whitespace then "*hn\ do*" does not work. Escaping wildcards not helping either.

My field definition as follows:

 <fieldType name="string" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <!--<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="25" side="back" />-->
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
 </fieldType>

Upvotes: 3

Views: 325

Answers (1)

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8678

Try using NGramTokenizerFactory . It will generates n-gram tokens of sizes in the given range. As below

<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="10"/>
</analyzer>

It will works as :

In: "john doe"
Out: "jo","joh","john", "john ","john d","john do",
"john doe", "oh", "ohn","ohn ", "ohn d"...

And remove the KeywordTokenizerFactory from the fieldType definition.

You can also think of using solr.EdgeNGramTokenizerFactory

It has another attribute side.

side: ("front" or "back", default is "front") Whether to compute the n-grams from the beginning (front) of the text or from the end (back)

It will works as :

In: "babaloo"
Out: "oo", "loo", "aloo", "baloo"

KeywordTokenizerFactory : This tokenizer treats the entire text field as a single token.

Upvotes: 2

Related Questions