sanitar4eg
sanitar4eg

Reputation: 163

Solr not escapes white spaces for searching

I use Solr server with version 6.4.1.

I need to search field which could contain spec symbols like -_.. But at the same time, I need an opportunity to find the entity without those symbols.

For example, the value is G2-5SG. I should find it by next queries: G2 5SG, G2-5SG, G25SG.

I have following configuration for the type:

    <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w+)([-_.\s])" replacement="$1"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="16"/>
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w+)([-_.\s])" replacement="$1"/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>

Search with spec symbols works fine. But when I try to search word without symbols server returns an empty set.

Values in analyzer are marked as satisfying, for index G2 5SG and for query G25SG.

Upvotes: 0

Views: 607

Answers (2)

Persimmonium
Persimmonium

Reputation: 15771

one thing that would work would be:

  • use a copyfield to have two fields fed with the same text, but analyzed differently
  • one field you keep with those symbols you need, maybe just lowercase, and use KeywordTokenizerFactory
  • the second field is similar, but remove all such chars, leave just alphanumeric values
  • now you use edismax parser to search in both fields. Besides, you can give more weight to the first field, that is more 'real' than the second. So you will have improved relevancy

Upvotes: 1

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8658

You can use

<tokenizer class="solr.StandardTokenizerFactory"/>

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.

instead of

<tokenizer class="solr.KeywordTokenizerFactory"/>

This tokenizer treats the entire text field as a single token.

You may try something like below.

<fieldtype name="subword" class="solr.TextField">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterFilterFactory"/>
      <filter class="solr.FlattenGraphFilterFactory"/> 
    </analyzer>
    <analyzer type="query">
           <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
</fieldtype>

For more details please refer the tokenizer page Tokenizers

Upvotes: 0

Related Questions