radamou
radamou

Reputation: 1

SOLR WordDelimiterFilterFactory

I use WordDelimiterFilterFactory to split words that have numbers into solr tokens. For example the word Php5 is split in two tokens "PHP", "5".When searching, the request that is executed by SOLR is q="php" and q="5". But this request finds even results with "5" only. What I want is to find documents with "PHP5" or "PHP 5" only.

If someone has any idea to get around this please.

Hope it is clear.

Thank's.

Upvotes: 0

Views: 7701

Answers (2)

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8678

This filter splits tokens at word delimiters.

In your case you can opt for splitOnNumerics="0", so it wont spilt on numbers.

splitOnNumerics:

(integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000"

The rules for determining delimiters are determined in the below link

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter

Upvotes: 0

Evan de la Cruz
Evan de la Cruz

Reputation: 1966

You need to get solr, in addition to indexing "php5", to index "php 5" as a single token. That way a search for "php 5" will match but a search for "blah 5" will not, for example.

The only way I was able to get this to work well was to use the Auto Phrasing filter by lucid works.

    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
        />
        <filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />  
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
        />
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>

synonyms.txt

php5,php_5

protwords.txt (so the delimiter doesn't break it)

php5,php_5

You also have to change the query parser to use the lucid parser.

solrconfig.xml

<queryParser name="autophrasingParser" class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
  <str name="phrases">autophrases.txt</str>
  <str name="replaceWhitespaceWith">_</str>
  <str name="ignoreCase">false</str>
</queryParser> 
<requestHandler name="/searchp" class="solr.SearchHandler">
    <lst name="defaults">
         <str name="echoParams">explicit</str>
         <int name="rows">10</int>
         <str name="df">Keywords</str>
         <str name="defType">autophrasingParser</str>
    </lst>
</requestHandler>  

autophrases.txt

php 5

The filter can be found here: https://github.com/LucidWorks/auto-phrase-tokenfilter

This article was also very helpful: http://lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/

Upvotes: 1

Related Questions