Reputation: 11

How to create a Solr filter that removes lowercase tokens

I'm new a Solr but have been researching this for about a week and can't figure it out. Any guidance is much appreciated.

My use case is simple: I want to remove all lowercase tokens from a field. I only want to index capitalized words.

I have tried using a tokenizer to do this (in my schema.xml):

<fieldType name="text_upper" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\[A-Z\]\[A-Za-z\]" group="0"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
  </analyzer>
</fieldType>

But this resulted in no tokens.

I really would like to just use the "solr.StandardTokenizerFactory" tokenizer, then apply a filter to remove lowercase tokens, but I've looked through all the filters and can't find one that will accomplish this.

Do I need to write my own filter for this or does anyone have any ideas for me? Thank you!

Upvotes: 1

Answers (1)

Emad

Reputation: 544

Probably you need to use PatternCaptureGroupFilterFactory not PatternTokenizerFactory

If you look at the documentation of Solr https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory

PatternTokenizerFactory is used to split the input string, so basically it is used to match seperators, not the actual tokens.

If you need a filter to match/emit tokens, I think you should be using PatternCaptureGroupFilterFactory

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternCaptureGroupFilterFactory

So, I would re-write your schema type as the following

<fieldType name="text_upper" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.PatternCaptureGroupFilterFactory" pattern="([A-Z][A-Za-z]*)" preserve_original="false"/>
    </analyzer>
</fieldType>

Upvotes: 2

How to create a Solr filter that removes lowercase tokens

Answers (1)

Related Questions