Dhiwakar Kusuma
Dhiwakar Kusuma

Reputation: 41

Azure Search - phonetic search implementation

I was trying out Phoenetic search using Azure Search without much luck. My objective is to work out an Index configuration that can handle typos and accomodate phonetic search for end users.

With the below configuration and sample data, I was trying to search for intentionally misspelled words like 'softvare' or 'alek'. I got results for 'alek' thanks for Phonetic analyzer; but didn't get any results for 'softvare'.

Looks like for this requirement phonetic search will not do the trick.

Only option that I found was to use synonyms map. The major pitfall is that I'm unable to use the Phonetics / Custom analyzer along with Synonyms :(

What are the various strategies that you would recommend for taking care of typos?

search query used

  1. ?api-version=2017-11-11&search=alec
  2. ?api-version=2017-11-11&search=softvare

Here is the index configuration

 "name": "phonetichotels",  
 "fields": [
   {"name": "hotelId", "type": "Edm.String", "key":true, "searchable": false},
   {"name": "baseRate", "type": "Edm.Double"},
   {"name": "description", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "analyzer":"my_standard"},
   {"name": "hotelName", "type": "Edm.String", "analyzer":"my_standard"},
   {"name": "category", "type": "Edm.String", "analyzer":"my_standard"},
   {"name": "tags", "type": "Collection(Edm.String)", "analyzer":"my_standard"},
   {"name": "parkingIncluded", "type": "Edm.Boolean"},
   {"name": "smokingAllowed", "type": "Edm.Boolean"},
   {"name": "lastRenovationDate", "type": "Edm.DateTimeOffset"},
   {"name": "rating", "type": "Edm.Int32"},
   {"name": "location", "type": "Edm.GeographyPoint"}
  ],

Analyzer (part of the index creation)

"analyzers":[
    {
      "name":"my_standard",
      "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
      "tokenizer":"standard_v2",
      "tokenFilters":[ "lowercase", "asciifolding", "phonetic" ]
    }
  ]

Analyze API Input and Output for 'software'

{
     "analyzer":"my_standard",
     "text": "software"
  }

{
    "@odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
    "tokens": [
        {
            "token": "SFTW",
            "startOffset": 0,
            "endOffset": 8,
            "position": 0
        }
    ]
}

Analyze API Input and Output for 'softvare'

{
     "analyzer":"my_standard",
     "text": "softvare"
  }

{
    "@odata.context": "https://ctsazuresearchpoc.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
    "tokens": [
        {
            "token": "SFTF",
            "startOffset": 0,
            "endOffset": 8,
            "position": 0
        }
    ]
}

Sample data that I loaded

{
         "@search.action": "upload",
         "hotelId": "5",
         "baseRate": 199.0,
         "description": "Best hotel in town for software people",
         "hotelName": "Fancy Stay",
         "category": "Luxury",
         "tags": ["pool", "view", "wifi", "concierge"],
         "parkingIncluded": false,
         "smokingAllowed": false,
         "lastRenovationDate": "2010-06-27T00:00:00Z",
         "rating": 5,
         "location": { "type": "Point", "coordinates": [-122.131577, 47.678581] }
       },
{
         "@search.action": "upload",
         "hotelId": "6",
         "baseRate": 79.99,
         "description": "Cheapest hotel in town ",
         "hotelName": " Alec Baldwin Motel",
         "category": "Budget",
         "tags": ["motel", "budget"],
         "parkingIncluded": true,
         "smokingAllowed": true,
         "lastRenovationDate": "1982-04-28T00:00:00Z",
         "rating": 1,
         "location": { "type": "Point", "coordinates": [-122.131577, 49.678581] }
       },

With the right configuration, I should have got results even with the misspelled words.

Upvotes: 1

Views: 1019

Answers (2)

Dhiwakar Kusuma
Dhiwakar Kusuma

Reputation: 41

As you could read in my post; my Objective was to handle the typos.

The only easy option is to use the inbuilt Lucene functionality - Fuzzy Search. I'm yet to check on the response times as the querytype has to be set to 'full' for using fuzzy search. Otherwise, the results were satisfactory.

Example: search=softvare~&fuzzy=true&querytype=full will return all documents with the 'Software' in it.

For further reading please go through Documentation

Upvotes: 1

Ishan Srivastava
Ishan Srivastava

Reputation: 620

I work on Azure Search. Before I suggest approaches to handle misspelled words, it would be helpful to look at your custom analyzer (my_standard) configuration. It might tell us why it's not able to handle the case for 'softvare'. As a DIY, you can use the Analyze API to see the tokens created using your custom analyzer and it should contain 'software' to actually match the docs.

Now then, here are a few ways that can be used independently or in conjunction to handle misspelled words. The best approach varies depending on the use-case and I strongly suggest you experiment with these to figure out the best one in your case.

  1. You are already familiar with phonetic filters which is a common approach to handle similarly pronounced terms. If you haven't already, try different encoders for the filter to evaluate which configuration gives you the best results. Check out the list of encoders here.

  2. Use fuzzy queries supported as part of the Lucene query syntax in Azure Search which returns terms that are near the original query term based on a distance metric. The limitation here is that it works on a single term. Check the docs for more details. Sample query would look like - search=softvare~1 You can also use term boosting to give the original term more boost in cases where the original term is also a valid term.

  3. You also alluded to synonyms which is also used to query with misspelled terms. This approach gives you the most control over the process of handling typos but also require you to have prior knowledge of different typos for terms. You can use these docs if you want to experiment with synonyms.

Upvotes: 2

Related Questions