Reputation: 602
This is a question for Azure Cognitive Search team.
Facing serious issues with advanced search features, like 'fuzzy search' and 'wildcard search'. Currently using Standard Lucene Analyzer on my indexed field.
System return results for search query 'terminate', results contain: terminate, termination, terminates, and etc. So the results look good. But when I try to search for '*terminat****' (of course using queryType=full parameter) search is not returning any results. According to the documentation, wildcard search should return *'terminate', 'termination', 'terminates'*** and other terms that starts with 'terminat*'.
Same problem with the fuzzy search. If I search for 'terminate~' I am not getting any results at all.
Situation seems to be better if I use 'Microsoft Analyzer'. At least fuzzy search and wildcard returns at least something ...
Is this a bug? Or this is an expected behaviour? Probably I misunderstood the documentation?
Upvotes: 1
Views: 717
Reputation: 990
You got it right that this is due to how the EN.Lucene analyzer tokenize text. Lucene analyzers apply aggressive stemming of each term. For example, terminate, termination, terminates will all be tokenized down to the token "termi" in your index. On the other side, terms in queries using wildcards or fuzzy search are not analyzed at all.
This means that at indexing time, your documents only have the token "termi" in the inverted index, however, at search time, the term "terminat" stays whole (not reduced to "termi"). Fuzzy search has a limit of 2 edit distance, so "terminat" will never match "termi" with fuzzy search alone. Wildcard also won't help as "terminat*" also does not match.
On the other side, the Microsoft analyzers are a bit more advanced and use lemmatization instead of stemming. This means that all generated tokens should be valid English words. For example, terminate, terminates and termination will mostly stay whole in the index, and would be a preferable choice for scenarios that depend a lot on wildcards and fuzzy search.
Upvotes: 5