Reputation: 2567
This is my (pretty standard) ngram schema --
<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
now laptop_ngram:"g74sx-a" returns --
<arr name="laptop_ngram">
<str>ASUS G74SX-A1 17.3-Inch Gaming Laptop</str>
</arr>
but laptop_ngram:"g74sx-a1" finds nothing.
BTW, escaping the "-" does not make any difference.
Any thought?
Upvotes: 3
Views: 935
Reputation: 2567
Thanks to O. Klein, who showed me new direction.
I finally settle with WhitespaceTokenizerFactory plus WordDelimiterFilterFactory --
<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
which works for "g74sx", "g74sx-", "g74sx-a", and "g74sx-a1"
However, the journey didn't end here, as I'm still exploring why --
"G74SX-XA1" is found with "g74sx-x" and "g74sx-xa1", but not "g74sx-xa"...
Upvotes: 1
Reputation: 2549
The StandardTokenizerFactory might do something to the term. You can check this in the analysis page.
So changing to WhitespaceTokenizerFactory could fix the problem.
Upvotes: 1