geek
geek

Reputation: 11

Azure search - analyzers

I'm working on a search tool for documents like:

A)
Code: AB-Y X6 8BD
Name: Notebook AZX
Manufacturer: DELL

B)
Code: AB-Y X6 9BD
Name: Notebook 8BD
Manufacturer: DELL

What I need to achive is a query that will match document on Code field only if user type all characters inlcuded in Code field. Other fields like Name and Manufacturer are also searchable but exact match on Name field is not needed. What makes it not a trivial task is the fact that user can type in code with different format: with spaces or without spaces, with '-' or without. Is it achievable with Azure Search? I was thinking about keywordanalyzer for Code field and field-scoped queries but I don't know where in search query user located the code.

To give a better picture of what I'm trying to achive here are some examples:

- query 'ABYX6 8BD DELL AZX' - returns product A
- query 'ABYX6 DELL AZX' - empty result
- query 'DELL ABYX69BD AZX' - returns product 
- query 'DELL Notebook' - returns product A & B

Upvotes: 1

Views: 1104

Answers (1)

Matthew Gotteiner
Matthew Gotteiner

Reputation: 405

Your question has 2 parts:

  1. How to normalize the product code (removing spaces and hyphens)
  2. How to apply the right analyzer to the component of the query string to achieve different behavior for different fields (requiring an exact match on the code field)

For product code normalization, you can use a few custom analyzer features:

  1. Mapping char filter. "A char filter that applies mappings defined with the mappings option. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string." We'll use this to remove hyphens and spaces from product codes.
  2. Uppercase. "Normalizes token text to upper case." This means users don't have to worry about capitalizing product code letters

Here's a complete example of an index with these analyzer options set. The example index has an id field. id is analogous to the product code

{
    "analyzers":  [
                      {
                          "@odata.type":  "#Microsoft.Azure.Search.CustomAnalyzer",
                          "tokenFilters":  [
                                               "uppercase"
                                           ],
                          "charFilters":  [
                                              "hyphen-filter"
                                          ],
                          "name":  "id-analyzer",
                          "tokenizer":  "standard_v2"
                      }
                  ],
    "charFilters":  [
                        {
                            "mappings":  [
                                             "-=>",
                                             "\\u0020=>"
                                         ],
                            "name":  "hyphen-filter",
                            "@odata.type":  "#Microsoft.Azure.Search.MappingCharFilter"
                        }
                    ],
    "name":  "index",
    "fields":  [
                   {
                       "key":  true,
                       "name":  "key",
                       "type":  "Edm.String"
                   },
                   {
                       "analyzer":  "id-analyzer",
                       "name":  "id",
                       "type":  "Edm.String"
                   }
               ]
 }

You can find full documentation for the create index call here. Note that you may not use the portal since custom analyzers are not supported.

Make sure the product code part of the query is inside a phrase e.g., “ABYX6 8BD DELL AZX” - this way the query parser will send the whole phrase as a token to the lexical analyzer for processing. You can learn more about that here: How full text search works in Azure Search.

The second question is trickier. If you don't know where in the query string the product code is, then we can’t know. Unless fielded search syntax is used, the entire query string will be processed for each field independently with the analyzer configured on that field. This means if we perform the normalization correctly, for query “ABYX6 8BD DELL AZX” Azure Search will try to match terms - ABYX68BDDELLAZX – against the Code field - abyx6 8bd dell azx – against the other two fields assuming they are using the standard analyzer

The first query won’t match, so only documents that have dell or azx somewhere in Name or Manufacturer will be returned.

I’d recommend modifying the UX of the application to allow users to input product code independently allowing for some variability in the format. The only other alternative is to treat any query as a free text query and allow the search engine to match many results and rank higher the ones that matched more terms.

Please let me know if you have additional questions.

Thanks, Matt

Upvotes: 1

Related Questions