Reputation: 6978

Multi-Language ElasticSearch Support

I am indexing messages from all around the world but mainly Thailand. The indexed messages will most likely contain either English or Thai.

Does anyone know the best way to set the ES index so that it will return good search results for both Thai and English searches?

I've tried using this setting:

{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "type" : "cjk"
                }
            }
        }
    }
}

The results for the cjk analyser are not great when searching in Thai. I actually don't know why that is but any help would be very much appreciated!

Upvotes: 4

Answers (2)

Bruno dos Santos

Reputation: 1361

The cjk analyzer is used to generate bigrams for Chinese, Japanese and Korean but not Thai. As Thai is a non-space language this analyzer doesn't tokenize the sentence. The recommended analyzer to use for Thai language is the thai analyzer.

{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "type" : "thai"
                }
            }
        }
    }
}

There is other option to analyse Thai data using the ICU Analysis Plugin that provides the icu_tokenizer. This tokenizer supports Thai, Lao, Chinese, Japanese and Korean languages. You can find the plugin by this link: ICU Analysis Plugin

After install the plugin you can use the tokenizer this way:

{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "type": "custom",
                    "tokenizer": "icu_tokenizer"
                }
            }
        }
    }
}

Upvotes: 2

zdk

Reputation: 1576

You could implement a custom thai analyzer as described in: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#thai-analyzer

And, to make it a bit more useful, also add a new filter in order to use org.apache.lucene.analysis.th.ThaiWordFilterFactory from the Apache Lucene by doing like so:

curl -X PUT http://localhost:9200/test -d '{
  "settings":{
    "analysis":{
      "analyzer":{
        "default":{
          "type":"custom",
          "tokenizer":"standard",
          "filters":[ "standard","thai","lowercase", "stop", "kstem" ]
        }
      }
    },
    "filter": {
      "thai": {
        "type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
      }
    }
  }
}’

Then, you could test with:

http://localhost:9200/test/_analyze?analyzer=thai&text=สวัสดี+hello

Hope this helps you.

Upvotes: 1

Multi-Language ElasticSearch Support

Answers (2)

Related Questions