Alex
Alex

Reputation: 33

How to exclude numbers from a solr text field?

I'm trying to get some data from documents, I am using facet to get all the word of a set of documents and their occurencies. The thing is I get a lot of results with numbers that I don't want. My field is huge string that is provided by my database, initially it's a binary file stored in this database.

I would like to filter those numbers in my request if possible.

<!-- text_fr with hunspell -->
  <fieldType name="text_fr_token" class="solr.TextField" positionIncrementGap="100">
   <!-- index analyser -->
    <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <!-- removes l', etc -->
      <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="contractions.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <!-- voir  si a supprimer  -->
      <filter class="solr.WordDelimiterGraphFilterFactory"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="1"
            splitOnCaseChange="1"
            splitOnNumerics="1"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" format="snowball" />
      <filter class="solr.HunspellStemFilterFactory"
        dictionary="fr_FR.dic"
        affix="fr_FR.aff"
        ignoreCase="true"
        strictAffixParsing="true"/>
    </analyzer>
 <!--Query analyser-->
    <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <!-- removes l', etc -->
      <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="contractions.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <!-- voir  si a supprimer  -->
      <filter class="solr.WordDelimiterGraphFilterFactory"
            generateWordParts="1"
            generateNumberParts="0"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="1"
            splitOnCaseChange="1"
            splitOnNumerics="1"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" format="snowball" />
      <filter class="solr.HunspellStemFilterFactory"
        dictionary="fr_FR.dic"
        affix="fr_FR.aff"
        ignoreCase="true"
        strictAffixParsing="true"/>
    </analyzer>
  </fieldType>

Upvotes: 0

Views: 565

Answers (1)

Peaeater
Peaeater

Reputation: 636

It's not clear to me whether you want to remove numbers from tokens, or remove tokens that are numbers.

To remove numbers from tokens, you could try adding a PatternReplaceFilterFactory to the index analyser section that uses a regular expression to remove digits.

<filter class="solr.PatternReplaceFilterFactory" pattern="(\d+)" replacement="" replace="all" />

To remove tokens that are numbers, you might be able to use one of the Regular Expression Tokenizers as described in the documentation here: https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-RegularExpressionPatternTokenizer.

Upvotes: 2

Related Questions