Stephan
Stephan

Reputation: 1037

Solr 5.1: Problems with search queries containing underscores

I've indexed an internal website using Solr 5.1 and the new managed schema. I've indexed the page title, url, and body using "text_en" and "text_en_splitting". I get pretty much the behavior I want except when the query string contains underscores.

My use case is the following: Suppose we have 3 terms, "first", "second" and "third", and that "second" does not exist in the index but "first" and "third" do. When the search term is "first second third", I get the behavior I want (i.e. pages with "first" and "third" are returned).

However, when the search term is "first_second_third", I get 0 results, but I would expect to get something since "first" and "third" exist in the index.

I'm using edismax search with qf=url_txt_en title_txt_en title_txt_en_split text_txt_en_split

Can someone suggest a way to tweak my config to get what I want?

Upvotes: 1

Views: 1531

Answers (3)

nir
nir

Reputation: 3868

You can just convert _ with any non-alphanumeric character that your Tokenizer tokenize on. In following case I converted it to hyphen '-' which is a valid delimiter for StandardTokenizerFactory

<charFilter class="solr.PatternReplaceCharFilterFactory" 
                    pattern="_" 
                    replacement="-"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>

Upvotes: 0

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8668

Try with below field type which used WordDelimiterFilterFactory. It Splits words into subwords and performs optional transformations on subword groups.

By default, words are split into subwords with the following rules:

1.split on intra-word delimiters (all non alpha-numeric characters). "Wi-Fi" -> "Wi", "Fi"

2.split on case transitions (can be turned off - see splitOnCaseChange parameter) "PowerShot" -> "Power", "Shot"

3.split on letter-number transitions (can be turned off - see splitOnNumerics parameter) "SD500" -> "SD", "500"

<fieldtype name="subword" class="solr.TextField">
      <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="1"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                preserveOriginal="1"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldtype>

Upvotes: 0

nofinator
nofinator

Reputation: 3023

Are you using the definition for text_en_splitting that comes with the Solr examples?

If so, the issue is that this type uses WhitespaceTokenizerFactory, which creates tokens separated by splitting on whitespace. It will ignore underscores.

Instead, it sounds like you need to tokenize on both whitespace and underscores. So try replacing that with PatternTokenizerFactory, like so:

<tokenizer class="solr.PatternTokenizerFactory" pattern="_\s*" />

Don't forget to change this in both the index and query analyzer blocks.

Upvotes: 1

Related Questions