Reputation: 116
I am using porter filter factory for a field which has 3 to 4 words in it.
Eg : "ABC BLOSSOM COMPANY"
I expect to fetch the above document when i search for ABC BLOSSOMING COMPANY as well.
When i query this:
name:ABC AND name:BLOSSOMING AND name:COMPANY
i get my result
This is what the parsed query looks like
+name:southern +name:blossom +name:compani (Stemmer works fine)
But when i add the fuzzy syntax and query like this,
name:ABC~1 AND name:BLOSSOMING~1 AND name:COMPANY~1
the search does not give any documents as result and the parsed query looks like this
+name:abc~1 +name:blossoming~1 +name:company~2
This clearly shows that stemming is not happening. Kindly review and give feedback.
Upvotes: 5
Views: 818
Reputation: 560
Well here's the configuration that somewhat did it for me, while experimenting:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
(yes, I modified existing "text_general" field, I said I was experimenting)
Using it with fuzzy edit distance 2, it produced following results for term "neglect":
1. Lost in Translation - A faded movie star and a neglected young woman...
2. Election - A high school teacher meets his match in an over-achieving...
3. Annie Hall - Alvy Singer, a divorced Jewish comedian, reflects on his relationship...
Which is somewhat good because the first result is appropriate.
Yet, if I search for "rescuing" with fuzzy search enabled, it produces nothing. And if fuzzy is disabled, the results are:
1. The Searchers - ... a years-long journey to rescue his niece from ...
2. Star Wars - ...while also attempting to rescue Princess Leia from...
So, the results of fuzzy + stemming is fairly inconsistent. Elasticsearch, which is Lucene based like SOLR, doesn't recommend using fuzzy with stemming:
This also means that if using say, a snowball analyzer, a fuzzy search for 'running', will be stemmed to 'run', but will not match the misspelled word 'runninga', which stems to 'runninga', because 'run' is more than 2 edits away from 'runninga'. This can cause quite a bit of confusion, and for this reason, it often makes sense only to use the simple analyzer on text intended for use with fuzzy queries, possibly disabling synonyms as well.
Source: https://www.elastic.co/blog/found-fuzzy-search
Upvotes: 0
Reputation: 9500
TL;DR
Stemming is not happening, since you have used the PorterFilter, which is not a MultiTermAwareComponent.
What To Do?
Use one of the Filters/Normalizers that implements the MultiTermAwareComponent interface.
Explanation
You, like many others, are caught by Solr's and Lucense Multiterm behaviour. There is a good article about this topic on the Solr wiki. All though this article is dated, it still holds true
One of the surprises for most Solr users is that wildcards queries haven't gone through any analysis. Practically, this means that wildcard (and prefix and range) queries are case sensitive, which is at odds with expectations. As of this SOLR-2438, SOLR-2918, and perhaps SOLR-2921, this behavior is changed.
What's a multiterm you ask? Essentially it's any term that may "point to" more than one real term. For instance, run* could expand to runs, runner, running, runt, etc. Likewise, a range query is really a "multiterm" query as well. Before Solr 3.6, these were completely unprocessed, the application layer usually had to apply any transformations required, for instance lower-casing the input. Running these types of terms through a "normal" query analysis chain leads to all sorts of interesting behavior so was avoided.
Upvotes: 4