Reputation: 4364
I have the following field within my SOLR configure:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Within the field I could be storing:
Spiderman, Spider-man, Spider man
What I would like is for someone who searches for spiderman to get all 3 options and ideally someone who searches spider-man to get all 3 options. Apart from amending the content when it is indexed is there another way to effectively ignore special characters but not necessarily split on them?
Upvotes: 0
Views: 774
Reputation: 429
I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.
Upvotes: -1
Reputation: 9320
One of the possible solutions, especially if the number of delimeter character is small is to replace them via solr.PatternReplaceFilterFactory
like this:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="-" replacement=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If keyword tokenizer is a bad option, since it will preserve one token (which could be okay for a field like title), you could either create your own tokenizer, which will split title only on needed symbols or add additional filters like ngram to allow partial match on the title field.
Upvotes: -1