Tim Bezhashvyly
Tim Bezhashvyly

Reputation: 9090

Solr synonyms containing whitespace

I have a following field:

<fieldType name="brand" class="solr.TextField">
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonym-brand.txt" ignoreCase="false" expand="false"/>
    </analyzer>
</fieldType>

...

<field name="brand" type="brand" indexed="true" stored="false"/>

And synonyms file has something like this:

foo => Adidas
bar => adidas originals

Searching for brand:foo returns same results as for brand:Adidas while searching for brand:bar does not return anything.

Is it something wrong with my config or it is a multi-term synonym mapping so hard in Solr?

Upvotes: 1

Views: 1689

Answers (2)

Tim Bezhashvyly
Tim Bezhashvyly

Reputation: 9090

I ended up with replacing spaces with underscores (could be any other character which is definitely not used in field values):

<fieldType name="brand" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\s)" replacement="_"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\s)" replacement="_"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonym-brand.txt" ignoreCase="false" expand="false"/>
    </analyzer>
</fieldType>

Upvotes: 1

Zeke Farwell
Zeke Farwell

Reputation: 261

Multi-term synonyms are definitely difficult to deal with in Solr. One of its biggest shortcomings in my opinion. From the Solr Documentation:

Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:

  1. The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
  2. Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document

The way I've dealt with this issue is to handle any multi-word synonyms at index time as recommended by the Solr docs and the article you linked to. I made a query time synonym file to handle all single word synonym sets, and a separate index time synonym file for sets with multi-word variants. With your example, the xml would look something like this

<fieldType name="brand" class="solr.TextField">
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms-query.txt" ignoreCase="false" expand="false"/>
    </analyzer>
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms-index.txt" ignoreCase="false" expand="true"/>
    </analyzer>
</fieldType>

synonyms-query.txt contents:

foo => Adidas

synonyms-index.txt contents:

adidas originals => bar

A search for brand:bar should now return results containing "adidas originals", but now a search for brand:adidas won't return results. This is because the whole phrase "adidas originals" has been replaced by "bar" in the index. Since this is probably not what you want, you might change the synonyms-index.txt file to use equivalent synonyms instead of explicit mapping:

adidas originals, bar

With this syntax any instance of "adidas originals" or "bar" will get expanded to include both in the index. If none of your brand names actually include "bar" then this shouldn't be a problem, but if they do then you can use the workaround mentioned in this answer:

synonyms-query.txt contents:

foo => Adidas
bar => adidasoriginals
adidas originals => adidasoriginals

synonyms-index.txt contents:

adidas originals => adidasoriginals

This setup removes the whitespace from "adidas originals" at index time, and at query time. Now that the phrase is represented by a single token in the index you can use explicit mapping at query time without running into whitespace problems.

Configuring Solr synonyms definitely tried my patience. There's a lot of power there, but it is rather confusing. Good luck!

Upvotes: 2

Related Questions