Jerry
Jerry

Reputation: 2567

Solr Ngram Match Woe

This is my (pretty standard) ngram schema --

<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

now laptop_ngram:"g74sx-a" returns --

<arr name="laptop_ngram">
  <str>ASUS G74SX-A1 17.3-Inch Gaming Laptop</str>
</arr>

but laptop_ngram:"g74sx-a1" finds nothing.

BTW, escaping the "-" does not make any difference.

Any thought?

Upvotes: 3

Views: 935

Answers (2)

Jerry
Jerry

Reputation: 2567

Thanks to O. Klein, who showed me new direction.

I finally settle with WhitespaceTokenizerFactory plus WordDelimiterFilterFactory --

<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

which works for "g74sx", "g74sx-", "g74sx-a", and "g74sx-a1"

However, the journey didn't end here, as I'm still exploring why --

"G74SX-XA1" is found with "g74sx-x" and "g74sx-xa1", but not "g74sx-xa"...

Upvotes: 1

Okke Klein
Okke Klein

Reputation: 2549

The StandardTokenizerFactory might do something to the term. You can check this in the analysis page.

So changing to WhitespaceTokenizerFactory could fix the problem.

Upvotes: 1

Related Questions