Reputation: 1456
I have indexed 726719-B21 in text type field on which I have applied below analyzers.
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
Now when I search this word hyphen works as separator so it will fetch results which contains 726719 as well as B21. I only want result which have 726719-B21.
How can use/configure WordDelimiterFilterFactory search word 726719-B21?
How can I achieve this? Please suggest.
Upvotes: 3
Views: 2450
Reputation: 1737
You can always search with proximity.
It's a headache but you won't need to reindex your data.
"726719 B21"~1
It's not perfect (since it would find B21-727719) but it might be good enough.
Upvotes: -1
Reputation: 52802
The StandardTokenizerFactory
will explicitly split any token on -
:
Note that words are split at hyphens.
The ClassicTokenizerFactory
is the older version of the same Tokenizer, but it has a special rule:
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
Whether this is suitable depends on your input. If you can have 726719-BAT, then it won't fit.
Another option is to just use the WhitespaceTokenizerFactory
which will only split on actual whitespace (where java's test isWhitespace()
evaluates to true).
But if you're only indexing 726719-B21
into the field and only want to match it completely, you can use a StrField instead (usually defined as string
in your schema) - or if you want it to be case insensitive, use a KeywordTokenizer
together with a lowercasefilter.
The other filters you have defined in your sequence might also change your content in fundamental ways (such as stemming, where the end of the tokens will be removed if they match any of a pre-defined set of patterns).
Upvotes: 3