Reputation: 2571
The icu_tokenizer in elasticsearch seems to break a word into segments when it encounters accented characters such as Č
and also returns strange numeric tokes. Example
GET /_analyze?text=OBČERSTVENÍ&tokenizer=icu_tokenizer
returns
"tokens": [
{
"token": "OB",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "268",
"start_offset": 4,
"end_offset": 7,
"type": "<NUM>",
"position": 2
},
{
"token": "ERSTVEN",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 3
}
]
}
I don't know czech, but quick google suggests OBČERSTVENÍ is a single word. Is there way to configure elastic search to work properly for czech words?
I have tried using icu_noramlizer as below, but it didn't help
PUT /my_index_cz
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"char_filter": ["icu_normalizer"],
"tokenizer": "icu_tokenizer"
}
}
}
}
}
GET /my_index_cz/_analyze?text=OBČERSTVENÍ&analyzer=my_analyzer
Upvotes: 0
Views: 447
Reputation: 2571
The issue was I using elasticsearch sense plugin to query this and it was not encoding the data properly. It worked fine when I wrote a test using python client library.
Upvotes: 1