Reputation: 1825
I am trying to force Solr to tokenize document on white-space, comma, :
and ;
. Something similar to what SQL Server Full Text search does. If I use text_general
field then it tokenizes on other characters as well like ('/','\','-')
, I tried using
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>
But it doesn't tokenize it. Here is how my FieldType
looks like:
<fieldType name="text_sqlserver" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,:;\s*"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Is there anything that I am missing ? I have to search for case insensitive comparison as well.
Upvotes: 2
Views: 3631
Reputation: 16035
Your pattern is actually wrong, try something like this instead :
pattern="[\s,;:]"
An alternative you might want to try :
PatternReplaceCharFilterFactory
(to replace ,
:
;
with whitespace)
WhitespaceTokenizerFactory
which tokenizes on whitespace and offers interesting options.
Upvotes: 6