Reputation: 3782
I have a field type with a simple WhitespaceTokenizer
followed by a WordDelimiterGraphFilter
. This should allow phrase queries with terms like "E-mail" to find things with both "E-mail" and "E mail" in them. However, in some circumstances this does not work.
This can be reproduced with a toy data set of terms separated by a varying number of hyphens:
An older version of the question used 6 single-character terms, a-b-c-d-e-f.
There are 8 combinations in total (for the three possible positions of the hyphens). Which means a search for any one of the above items will find all 8.
However, some combinations of phrase queries will not be found. For example, a search for "one two-three four" finds all 7 terms except for itself ("two-three four" works on the other hand).
The fieldType
in the schema.xml
is as follows:
<fieldType name="text_wrong" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="true" multiValued="false" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
splitOnNumerics="0" preserveOriginal="1" />
<filter class="solr.FlattenGraphFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
splitOnNumerics="0" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Increasing the query slop to 2 finds all results but will also find other similar results that aren't exact (undesired behaviour).
It can also be "fixed" by setting preserveOriginal=0
. But I'm not sure what other side-effects that might cause on our searches and it does not seem to be the correct behaviour.
The Analysis looks like this:
As you can see, the position of "four" is now in position '4' although it is the third term in "one two-three four". But this should match up with the positions in the query (which are identical).
Is this correct or a bug?
Upvotes: 1
Views: 858
Reputation: 52792
You're missing an index time Flatten Graph Filter.
This filter splits tokens at word delimiters.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Word Delimiter Filter, because the indexer can’t directly consume a graph. To get fully correct positional queries when tokens are split, you should instead use this filter at query time.
Upvotes: 1