Reputation: 1
I use WordDelimiterFilterFactory to split words that have numbers into solr tokens. For example the word Php5 is split in two tokens "PHP", "5".When searching, the request that is executed by SOLR is q="php" and q="5". But this request finds even results with "5" only. What I want is to find documents with "PHP5" or "PHP 5" only.
If someone has any idea to get around this please.
Hope it is clear.
Thank's.
Upvotes: 0
Views: 7701
Reputation: 8678
This filter splits tokens at word delimiters.
In your case you can opt for splitOnNumerics="0"
, so it wont spilt on numbers.
splitOnNumerics
:(integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000"
The rules for determining delimiters are determined in the below link
Upvotes: 0
Reputation: 1966
You need to get solr, in addition to indexing "php5", to index "php 5" as a single token. That way a search for "php 5" will match but a search for "blah 5" will not, for example.
The only way I was able to get this to work well was to use the Auto Phrasing filter by lucid works.
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
synonyms.txt
php5,php_5
protwords.txt (so the delimiter doesn't break it)
php5,php_5
You also have to change the query parser to use the lucid parser.
solrconfig.xml
<queryParser name="autophrasingParser" class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
<str name="phrases">autophrases.txt</str>
<str name="replaceWhitespaceWith">_</str>
<str name="ignoreCase">false</str>
</queryParser>
<requestHandler name="/searchp" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">Keywords</str>
<str name="defType">autophrasingParser</str>
</lst>
</requestHandler>
autophrases.txt
php 5
The filter can be found here: https://github.com/LucidWorks/auto-phrase-tokenfilter
This article was also very helpful: http://lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
Upvotes: 1