Reputation: 11
I'm new a Solr but have been researching this for about a week and can't figure it out. Any guidance is much appreciated.
My use case is simple: I want to remove all lowercase tokens from a field. I only want to index capitalized words.
I have tried using a tokenizer to do this (in my schema.xml):
<fieldType name="text_upper" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\[A-Z\]\[A-Za-z\]" group="0"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
</analyzer>
</fieldType>
But this resulted in no tokens.
I really would like to just use the "solr.StandardTokenizerFactory" tokenizer, then apply a filter to remove lowercase tokens, but I've looked through all the filters and can't find one that will accomplish this.
Do I need to write my own filter for this or does anyone have any ideas for me? Thank you!
Upvotes: 1
Views: 459
Reputation: 544
Probably you need to use PatternCaptureGroupFilterFactory not PatternTokenizerFactory
If you look at the documentation of Solr https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory
PatternTokenizerFactory is used to split the input string, so basically it is used to match seperators, not the actual tokens.
If you need a filter to match/emit tokens, I think you should be using PatternCaptureGroupFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternCaptureGroupFilterFactory
So, I would re-write your schema type as the following
<fieldType name="text_upper" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternCaptureGroupFilterFactory" pattern="([A-Z][A-Za-z]*)" preserve_original="false"/>
</analyzer>
</fieldType>
Upvotes: 2