Druckles
Druckles

Reputation: 3782

Solr cannot find all terms after WordDelimiterGraphFilter

I have a field type with a simple WhitespaceTokenizer followed by a WordDelimiterGraphFilter. This should allow phrase queries with terms like "E-mail" to find things with both "E-mail" and "E mail" in them. However, in some circumstances this does not work.

This can be reproduced with a toy data set of terms separated by a varying number of hyphens:

An older version of the question used 6 single-character terms, a-b-c-d-e-f.

There are 8 combinations in total (for the three possible positions of the hyphens). Which means a search for any one of the above items will find all 8.

However, some combinations of phrase queries will not be found. For example, a search for "one two-three four" finds all 7 terms except for itself ("two-three four" works on the other hand).

The fieldType in the schema.xml is as follows:

<fieldType name="text_wrong" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="true" multiValued="false" omitNorms="true">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"
                generateWordParts="1" generateNumberParts="1" catenateWords="0"
                catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
                splitOnNumerics="0" preserveOriginal="1" />
        <filter class="solr.FlattenGraphFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"
                generateWordParts="1" generateNumberParts="1" catenateWords="0"
                catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
                splitOnNumerics="0" preserveOriginal="1" />
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
</fieldType>

Increasing the query slop to 2 finds all results but will also find other similar results that aren't exact (undesired behaviour).

It can also be "fixed" by setting preserveOriginal=0. But I'm not sure what other side-effects that might cause on our searches and it does not seem to be the correct behaviour.

The Analysis looks like this:

Solr analysis of "one two-three four"

As you can see, the position of "four" is now in position '4' although it is the third term in "one two-three four". But this should match up with the positions in the query (which are identical).

Is this correct or a bug?

Upvotes: 1

Views: 858

Answers (1)

MatsLindh
MatsLindh

Reputation: 52792

You're missing an index time Flatten Graph Filter.

WordDelimiterGraphFilter

This filter splits tokens at word delimiters.

If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Word Delimiter Filter, because the indexer can’t directly consume a graph. To get fully correct positional queries when tokens are split, you should instead use this filter at query time.

Upvotes: 1

Related Questions