condit
condit

Reputation: 10962

WordDelimiterFilterFactory not including all permutations

I have a Solr index that has to deal with part numbers - which the WordDelimiterFilterFactory seems ideally suited for. An example part number could be "CH2300-100". I'm expecting the following queries to match this field (and they do):

But the following query doesn't match:

Looking at the debugging output - that combination of word parts isn't generated. I expected the catenateWords and/or catenateNumbers attribute to handle this case but it seems not to work. Am I missing something in the configuration that would allow all permutations of the tokenized fragments to be matched?

<schema version="1.5" name="test">
  <types>
    <fieldType name="text" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="1" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
      </analyzer>
    </fieldType>
  </types>
  <fields>
    <field stored="true" name="id" type="text" />
    <field stored="true" indexed="true" name="catnum" type="text" />
  </fields>
  <uniqueKey>id</uniqueKey>
</schema>

Upvotes: 2

Views: 3589

Answers (1)

accounted4
accounted4

Reputation: 1705

I suspect that 'CH2300' is not an indexed token because splitOnNumerics="1". At the split phase, it separates CH and 2300 and then it applies all of the generators to those individually (as well as to the catenated tokens).

One option is to add splitOnNumerics="0" to your filter factory. However, that may keep 'CH' from matching. Another option is to add a filter factory at query time that splits on numerics.

Edit

A third and possibly better option is to use a shingle filter factory and to leave splitOnNumerics="1" so that all of CH, 2300, and CH2300 get indexed. Adding this line after your word delimiter filter factory should solve the problem:

<filter class="solr.ShingleFilterFactory" tokenSeparator=""/>

Upvotes: 3

Related Questions