Reputation: 133
This requires a bit of explanation, but I think this discussion could help anyone with important phrases in their Solr index.
I'm using Solr to power search in an e-commerce context, and I'm trying to improve spell checking suggestions for brand names. Solr by default spell checks each word individually, without regard for if the resulting phrase makes sense. For instance, a search for "paula dean" brings back "Did you mean: paula bean?", while the brand name is actually "Paula Deen." Currently, my spelling dictionary is a whitespace tokenized field called spellField. In order to index full brand names for spell checking, I have my Solr import replace whitespace in brand names from my database with underscores, i.e. Entree Casual Dining -> Entree_Casual_Dining. Here is the schema for the fieldType of spellField:
<fieldType name="spellcheckquery" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(\s[0-9]+\s)|(^[0-9]+\s)|(\s[0-9]+$)|(^[0-9]+$)" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(\s[0-9]+\s)|(^[0-9]+\s)|(\s[0-9]+$)|(^[0-9]+$)" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="4" outputUnigrams="true"/>
</analyzer>
</fieldType>
Putting underscores in the brand name was the best way I could think of to keep multi-word brand names as single tokens in a whitespace tokenized field. I can easily strip the returned spelling suggestion of underscores after it comes back. So, now when a query comes in, Solr shingles the query, and looks for a spelling suggestion for each shingle, i.e. searching for the brand with a spelling error: "entre casual dining" -> "entre", "casual", "dining", "entre casual", "casual dining", "entre casual dining". The shingle "entre casual dining" is similar to what is in the index ("Entree_Casual_Dining"), so "entree_casual_dining" comes back as the suggestion. Great.
Suppose the query includes a brand name and a type of product, like "entre casual dining table set." We would want to find the spelling correction for the brand name and replace the entire misspelled brand to return the suggestion "entree_casual_dining table set." I figured Solr's collate functionality would handle this well. When I enter this search, though, Solr finds the correct brand suggestion, but it does not collate it back into the result:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">48</int>
</lst>
<result name="response" numFound="100" start="0"/>
<lst name="spellcheck">
<lst name="suggestions">
**<lst name="entre casual dining">**
<int name="numFound">1</int>
<int name="startOffset">0</int>
<int name="endOffset">19</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst>
**<str name="word">entree_casual_dining</str>**
<int name="freq">21</int>
</lst>
</arr>
</lst>
<bool name="correctlySpelled">false</bool>
**<str name="collation">entre casual dining table set</str>**
</lst>
</lst>
</response>
It has no problem collating when the correction came from a single misspelled word in the query. For example, if you misspell "table" it it will collate it back into the query properly.
What might be going wrong when it tries to collate a suggestion from a multi-word shingle?
Upvotes: 2
Views: 3923
Reputation: 11
I would consider changing your analyzer to look more like this:
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
By setting preserveOriginal to 1, it will tokenize the brand name as both individual words and as one big token. Also, I believe the Shingle stuff is deprecated and slated for removal in 4.0.
Upvotes: 1
Reputation: 193
i have seen solr successfully collate suggestions for multi-word keywords without any issue. i had used Solr 3.1 though. The one issue that i had was when multiple words in a multi-word keyword are misspelled and there are suggestions from solr then there are multiple combinations presented by the spell checker when you have turned on "collate" - so thats when it gets trickier.
Even with using "_" in your scenario, I assume it could get complicated depending on how badly the word is mis-spelled because it would consider "_" also as part of the word and use it for its calculations. Just wondering if its working as you expect it to work since the shingle factory also produces broken up shingles.
Upvotes: 0