Antonin Jelinek
Antonin Jelinek

Reputation: 2317

Azure Search - basic search in Czech language

I have an index created in Azure Search service where I have several string fields marked as searchable using Czech - Lucene analyzer. In Czech language we use some accented characters and it is common that people replace accented characters with non-accented when typing. Therefore, for example "Václav" (name) has the same meaning as "Vaclav". In my index, I have few documents with word "Václav" and none with word "Vaclav".

As much as I'd expect that Azure Search would return all documents containing word "Václav" when I search for "Vaclav", it is not the case. I'm wondering if I have to parse the query somehow before sending to the search engine.

I ran my tests both thru Azure Portal (setting API version to 2015-02-28-Preview) and thru my code using the very latest SDK Microsoft.Azure.Search 1.1.1.

Upvotes: 0

Views: 340

Answers (1)

Yahnoosh
Yahnoosh

Reputation: 1972

By default Lucene and Microsoft analyzers for the Czech language don't ignore diacritics. The easiest way to achieve what you want is to use standardasciifolding.lucene analyzer instead. Alternatively, you could build a custom analyzer to add the ASCII folding token filter to the standard analysis chain for Czech. For example:

{
  "name":"example",
  "fields":[
    {
      "name":"id",
      "type":"Edm.String",
      "key":true
    },
    {
      "name":"text",
      "type":"Edm.String",
      "searchable":true,
      "retrievable":true,
      "analyzer":"my_czech_analyzer"
    }
  ],
  "analyzers":[
    {
      "name":"my_czech_analyzer",
      "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
      "tokenizer":"standard",
      "tokenFilters":[
        "lowercase",
        "czech_stop_filter",
        "czech_stemmer",
        "asciifolding"
      ]
    }
  ],
  "tokenFilters":[
    {
      "name":"czech_stop_filter",
      "@odata.type":"#Microsoft.Azure.Search.StopTokenFilter",
      "stopwords_list":"_czech_"
    },
    {
      "name":"czech_stemmer",
      "@odata.type":"#Microsoft.Azure.Search.StemmerTokenFilter",
      "language":"czech"
    }
  ]
}

We realize that the experience is not optimal now. We’re working to make customizations like this easier.

Let me know if this answers your question

Upvotes: 1

Related Questions