Reputation: 33
I'm trying to get some data from documents, I am using facet to get all the word of a set of documents and their occurencies. The thing is I get a lot of results with numbers that I don't want. My field is huge string that is provided by my database, initially it's a binary file stored in this database.
I would like to filter those numbers in my request if possible.
<!-- text_fr with hunspell -->
<fieldType name="text_fr_token" class="solr.TextField" positionIncrementGap="100">
<!-- index analyser -->
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="contractions.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- voir si a supprimer -->
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" format="snowball" />
<filter class="solr.HunspellStemFilterFactory"
dictionary="fr_FR.dic"
affix="fr_FR.aff"
ignoreCase="true"
strictAffixParsing="true"/>
</analyzer>
<!--Query analyser-->
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- removes l', etc -->
<filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="contractions.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- voir si a supprimer -->
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1"
generateNumberParts="0"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" format="snowball" />
<filter class="solr.HunspellStemFilterFactory"
dictionary="fr_FR.dic"
affix="fr_FR.aff"
ignoreCase="true"
strictAffixParsing="true"/>
</analyzer>
</fieldType>
Upvotes: 0
Views: 565
Reputation: 636
It's not clear to me whether you want to remove numbers from tokens, or remove tokens that are numbers.
To remove numbers from tokens, you could try adding a PatternReplaceFilterFactory to the index analyser section that uses a regular expression to remove digits.
<filter class="solr.PatternReplaceFilterFactory" pattern="(\d+)" replacement="" replace="all" />
To remove tokens that are numbers, you might be able to use one of the Regular Expression Tokenizers as described in the documentation here: https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-RegularExpressionPatternTokenizer.
Upvotes: 2