Reputation: 6978
I am indexing messages from all around the world but mainly Thailand. The indexed messages will most likely contain either English or Thai.
Does anyone know the best way to set the ES index so that it will return good search results for both Thai and English searches?
I've tried using this setting:
{
"settings": {
"analysis" : {
"analyzer" : {
"default" : {
"type" : "cjk"
}
}
}
}
}
The results for the cjk analyser are not great when searching in Thai. I actually don't know why that is but any help would be very much appreciated!
Upvotes: 4
Views: 2105
Reputation: 1361
The cjk
analyzer is used to generate bigrams for Chinese, Japanese and Korean but not Thai. As Thai is a non-space language this analyzer doesn't tokenize the sentence. The recommended analyzer to use for Thai language is the thai
analyzer.
{
"settings": {
"analysis" : {
"analyzer" : {
"default" : {
"type" : "thai"
}
}
}
}
}
There is other option to analyse Thai data using the ICU Analysis Plugin that provides the icu_tokenizer
. This tokenizer supports Thai, Lao, Chinese, Japanese and Korean languages. You can find the plugin by this link: ICU Analysis Plugin
After install the plugin you can use the tokenizer this way:
{
"settings": {
"analysis" : {
"analyzer" : {
"default" : {
"type": "custom",
"tokenizer": "icu_tokenizer"
}
}
}
}
}
Upvotes: 2
Reputation: 1576
You could implement a custom thai analyzer as described in: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#thai-analyzer
And, to make it a bit more useful, also add a new filter in order to use org.apache.lucene.analysis.th.ThaiWordFilterFactory
from the Apache Lucene by doing like so:
curl -X PUT http://localhost:9200/test -d '{
"settings":{
"analysis":{
"analyzer":{
"default":{
"type":"custom",
"tokenizer":"standard",
"filters":[ "standard","thai","lowercase", "stop", "kstem" ]
}
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
}
}
}
}’
Then, you could test with:
http://localhost:9200/test/_analyze?analyzer=thai&text=สวัสดี+hello
Hope this helps you.
Upvotes: 1