enator
enator

Reputation: 2599

Elasticsearch: Does edgeNGram token filter work on non english tokens?

I am trying to setup a new mapping for an index. Which is going to support partial keyword search and auto-complete requests powered by ES.

edgeNGram token filter with whitespace tokeniser seems a way to go. Till now my setting looks something like this:

curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
    "index": {
        "analysis": {
            "analyzer": {
                "customNgram": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "customNgram"]
                }
            },
            "filter": {
                "customNgram": {
                    "type": "edgeNGram",
                    "min_gram": "3",
                    "max_gram": "18",
                    "side": "front"
                }
            }
        }
    }
}
}'

The problem is with Japanese words! Does NGrams work on japanese letters? For e.g.: 【11月13日13時まで、フォロー&RTで応募!】

There is no whitespace in this - The document is not searchable with partial keywords, is that expected?

Upvotes: 3

Views: 1372

Answers (1)

LaserJesus
LaserJesus

Reputation: 8540

You might want to look at the icu_tokenizer which adds support for foreign languages https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}

Note that to use it in your index you need to install the appropriate plugin:

bin/elasticsearch-plugin install analysis-icu

Adding this to your code:

curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
    "index": {
        "analysis": {
            "analyzer": {
                "customNgram": {
                    "type": "custom",
                    "tokenizer": "icu_tokenizer",
                    "filter": ["lowercase", "customNgram"]
                }
            },
            "filter": {
                "customNgram": {
                    "type": "edgeNGram",
                    "min_gram": "3",
                    "max_gram": "18",
                    "side": "front"
                }
            }
        }
    }
}
}'

Normally you would search an autocomplete like this using the standard analyzer, instead add an analyzer to your mapping also with the icu_tokenizer (but not the edgeNGram filter) and apply this to your query at search time, or explicitly set it as the search_analyzer for the field you apply customNgram to.

Upvotes: 3

Related Questions