Elastic Search : Configuring icu_tokenizer for czech characters

Question

The icu_tokenizer in elasticsearch seems to break a word into segments when it encounters accented characters such as Č and also returns strange numeric tokes. Example

GET /_analyze?text=OBČERSTVENÍ&tokenizer=icu_tokenizer

returns

   "tokens": [
      {
         "token": "OB",
         "start_offset": 0,
         "end_offset": 2,
         "type": "",
         "position": 1
      },
      {
         "token": "268",
         "start_offset": 4,
         "end_offset": 7,
         "type": "",
         "position": 2
      },
      {
         "token": "ERSTVEN",
         "start_offset": 8,
         "end_offset": 15,
         "type": "",
         "position": 3
      }
   ]
}

I don't know czech, but quick google suggests OBČERSTVENÍ is a single word. Is there way to configure elastic search to work properly for czech words?

I have tried using icu_noramlizer as below, but it didn't help

PUT /my_index_cz
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "char_filter": ["icu_normalizer"],
                    "tokenizer": "icu_tokenizer"                    
                }
            }
        }
    }
}

GET /my_index_cz/_analyze?text=OBČERSTVENÍ&analyzer=my_analyzer

Deepak N · Accepted Answer

The issue was I using elasticsearch sense plugin to query this and it was not encoding the data properly. It worked fine when I wrote a test using python client library.

Elastic Search : Configuring icu_tokenizer for czech characters

Answers (1)

Related Questions