Reputation: 1037
I've indexed an internal website using Solr 5.1 and the new managed schema. I've indexed the page title, url, and body using "text_en" and "text_en_splitting". I get pretty much the behavior I want except when the query string contains underscores.
My use case is the following: Suppose we have 3 terms, "first", "second" and "third", and that "second" does not exist in the index but "first" and "third" do. When the search term is "first second third", I get the behavior I want (i.e. pages with "first" and "third" are returned).
However, when the search term is "first_second_third", I get 0 results, but I would expect to get something since "first" and "third" exist in the index.
I'm using edismax search with qf=url_txt_en title_txt_en title_txt_en_split text_txt_en_split
Can someone suggest a way to tweak my config to get what I want?
Upvotes: 1
Views: 1531
Reputation: 3868
You can just convert _ with any non-alphanumeric character that your Tokenizer tokenize on. In following case I converted it to hyphen '-' which is a valid delimiter for StandardTokenizerFactory
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="_"
replacement="-"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
Upvotes: 0
Reputation: 8668
Try with below field type which used WordDelimiterFilterFactory. It Splits words into subwords and performs optional transformations on subword groups.
By default, words are split into subwords with the following rules:
1.split on intra-word delimiters (all non alpha-numeric characters). "Wi-Fi" -> "Wi", "Fi"
2.split on case transitions (can be turned off - see splitOnCaseChange parameter) "PowerShot" -> "Power", "Shot"
3.split on letter-number transitions (can be turned off - see splitOnNumerics parameter) "SD500" -> "SD", "500"
<fieldtype name="subword" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
Upvotes: 0
Reputation: 3023
Are you using the definition for text_en_splitting
that comes with the Solr examples?
If so, the issue is that this type uses WhitespaceTokenizerFactory
, which creates tokens separated by splitting on whitespace. It will ignore underscores.
Instead, it sounds like you need to tokenize on both whitespace and underscores. So try replacing that with PatternTokenizerFactory, like so:
<tokenizer class="solr.PatternTokenizerFactory" pattern="_\s*" />
Don't forget to change this in both the index and query analyzer blocks.
Upvotes: 1