http203
http203

Reputation: 841

Strange behavior using the wildcard character for certain keywords

In my azure cognitive search index, when I search for the term "education", I get 660 hits. When I search for the term "educational", I also get 660 hits. Both seem to return the same results containing both variations of the word alongside one another.

However, I am seeing very strange behavior when using the wildcard character:

edu* returns 660 results (expected)
educ* returns 660 results (expected)
educa* returns 2 results (matches two instances of the hyphenated word "educa-tion")
educat* returns 0 results (unexpected)
educati* returns 0 results (unexpected)
educatio* returns 0 results (unexpected)

Every search field uses the English Lucene language analyzer and queryType is set to "full" and searchMode is set to "all".

Why aren't the last results returning anything?

As an aside, I found conflicting information about using the wildcard character at the beginning of a word.

The lucene documentation says:

Note: You cannot use a * or ? symbol as the first character of a search.

From: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

But on Microsoft's site, they seem to imply that it should work:

Term fragment comes after * or ?, with a forward slash to delimit the construct. For example, search=/.*numeric./ returns "alphanumeric".

From: https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_wildcard

I've tried *ducation (which returns an error) and /.*ducation./ (which returns 0 results).

Thank you for your help.

Upvotes: 0

Views: 146

Answers (1)

Dan Gøran Lunde
Dan Gøran Lunde

Reputation: 5353

When you use the English Lucene analyzer your content is stemmed aggressively. This is explained in the link you provided in the section "Impact of an analyzer on wildcard queries". If you change to the Microsoft English analyzer your example should work as expected.

https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#impact-of-an-analyzer-on-wildcard-queries

If you were to use the en.lucene (English Lucene) analyzer, it would apply aggressive stemming of each term. For example, 'terminate', 'termination', 'terminates' will all be tokenized down to the token 'termi' in your index. On the other side, terms in queries using wildcards or fuzzy search are not analyzed at all., so there would be no results that would match the 'terminat*' query.

On the other side, the Microsoft analyzers (in this case, the en.microsoft analyzer) are a bit more advanced and use lemmatization instead of stemming. This means that all generated tokens should be valid English words. For example, 'terminate', 'terminates' and 'termination' will mostly stay whole in the index, and would be a preferable choice for scenarios that depend a lot on wildcards and fuzzy search.

Upvotes: 1

Related Questions