Ivinsky
Ivinsky

Reputation: 21

Solr search numbers

QA site: snknop38we.azurewebsites.net/

Example of query: Solr: GETting 'q=(111 AND (published:True) AND ((entity_type_id:19)) AND ((available_start_date_time_utc : [* TO NOW]) OR (: -available_start_date_time_utc : [* TO *])) AND ((available_end_date_time_utc : [NOW TO ]) OR (:* -available_end_date_time_utc : [* TO *]))), start=0, rows=20, qf=name short_description published=true is_out_of_stock=false, hl=true, hl.fl=name,short_description' from '/spell'

Expected results: VM­11110xl Kramer

Current results:

enter image description here

Scheme type for name & short description fields

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <!--<filter class="solr.SnowballPorterFilterFactory" language="Russian" protected="lang/protwords_lt.txt"/>-->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_ru.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!--<filter class="solr.SnowballPorterFilterFactory" language="Russian" protected="lang/protwords_ru.txt"/>-->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

How do we need to modify our scheme to support numbers search? Also we don't want to lose current search features

Upvotes: 0

Views: 1404

Answers (2)

MatsLindh
MatsLindh

Reputation: 52802

The main issue is that you want to match a substring of the token, so depending on exactly what you want to implement, adding an NGramFilter to the chain can be a solution. You'll have to tweak the values to get the hit ratio you're looking for, as it will also match "110" - depending on how you're structuring the data.

If you only want to match the start of each token, you can either use the EdgeNgramfilter, or use a wildcard search string (field:111*) (but remember that that might disable other parts of the token processing, so you're probably better off with an edgengramfilter in that case).

In both cases you'll only want to add the ngramfilter when indexing, not when querying.

Upvotes: 1

Ashraful Islam
Ashraful Islam

Reputation: 12830

Use the below Schema :

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <!--<filter class="solr.SnowballPorterFilterFactory" language="Russian" protected="lang/protwords_lt.txt"/>-->
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_ru.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!--<filter class="solr.SnowballPorterFilterFactory" language="Russian" protected="lang/protwords_ru.txt"/>-->
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
 </fieldType>

I have used WordDelimiterFilterFactory. It split word into subword by the following rules.

  • split on intra-word delimiters (all non alpha-numeric characters). "Wi-Fi" -> "Wi", "Fi"
  • split on case transitions (can be turned off – see splitOnCaseChange parameter) "PowerShot" -> "Power", "Shot"
  • split on letter-number transitions (can be turned off – see splitOnNumerics parameter) "SD500" -> "SD", "500"
  • leading and trailing intra-word delimiters on each subword are ignored "//hello---there, 'dude'" -> "hello", "there", "dude" trailing “‘s” are removed for each subword (can be turned off – see stemEnglishPossessive parameter) "O'Neil's" -> "O", "Neil"
    Note: this step isn’t performed in a separate filter because of possible subword combinations.

Source : http://www.pathbreak.com/blog/solr-text-field-types-analyzers-tokenizers-filters-explained

Upvotes: 0

Related Questions