hakunin
hakunin

Reputation: 4231

How to make Solr search by short words?

I've got an item that says "4k display" and when I search for "4k display" that item does not seem to be prioritized and other items with "display" (without 4k) come up.

If I search for "4k" nothing shows up.

What in the config should I change to remedy this?

Update: This is how the text type part looks like, likely setup by the sunspot gem.

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <!--<filter class="solr.StandardFilterFactory"/>-->
    <filter class="solr.LowerCaseFilterFactory"/>
    <!--<filter class="solr.KStemFilterFactory"/>-->
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="7"/>
  </analyzer>
</fieldType>

The minGram size looks like the culrpit?

Upvotes: 0

Views: 308

Answers (2)

root
root

Reputation: 3937

So lets walk through your analysis chain. First comes Standard Tokenizer. It will split on whitespaces. So "4K display" will split into two tokens

4k,display

Next one is lowercaseFilter. which will lower case the tokens so in this case nothing will change as its already lowercased. So by end of this step you still have the same two tokens

4k,display

Now comes the NGramFilterFactory which will start creating tokens like this. so e.g if you have a token called "abcd"

Ngram will produce tokens like this.

a,ab,abc,abcd,b, bc,bcd,c,cd,d

But there is another option defined in the ngram field type

minGramSize="3" maxGramSize="7"

Which means that only retain the tokens which have min lenght of 3 and max of 7. so in the above example you will only see

abc,abcd,bcd

So far with me.

Now lets apply it to your case. After lowercase filter we had two tokens

4k,display

Applying Ngram on both will produce following

4,4k,k,d,di,dis,disp,displ,displa,display,i,isp and so on . You get the idea.

But since miggram size is 3. 4 and 4k will be dropped from your index. Hence you are not able to search using 4k. Because it was never in the index.

your index only has value starting with dis like

dis,disp,displ,displa,display

In order to fix this. First you need to understand how you want to search your data.

Do you really need NGRamtokenizer ?

e.g IF you just want to get exact matches. e.g when you query "4k display", you want only results which has either "4k" or "display" or "4k display" then you need to change the your analysis chain.

Comment out the NGram from your analyis chain in that case and reindex and try querying again.

Upvotes: 2

MatsLindh
MatsLindh

Reputation: 52792

Your NGramFilter is configured to only keep tokens that have at least three characters:

<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="7"/>

4k is only two, so the filter doesn't produce any tokens for that input. If you want it to still keep 4k, even if it isn't long enough, you can try adding preserveOriginal="true" to the parameter (according to the javadoc from the filterfactory - but the code seems to look for a parameter named keepShortTerm, so try that if the first fails).

This will require reindexing your content, so that the new tokens are present for your documents.

Upvotes: 1

Related Questions