Reputation: 41745

Elasticserarch How to tokenize on whitespace and special word

In korean, a city name can have a suffix attached to it.

It's like Newyorkcity

People use either Newyork or Newyorkcity

I'd like to create analyzers (index/search) so that when people search for either newyork or newyorkcity, I could give all the newyork related documents.

I was looking at pattern tokenizer and thought I could make this with

"tokenizer": ["whitespace", "my_pattern_tokenizer"]

But then, found out you could have only one tokenizer in an analyzer.

How to achieve what I want?

Upvotes: 1

Answers (2)

Addicted

Reputation: 749

PUT index_name
{
  "mappings": {
    "_doc": {
        "properties": {
              "city": {
                "type": "text", "analyzer": "ngram_analyzer",
                "fields": {
                  "raw": {
                    "type": "keyword"
                  }
                }
              }
            }
        }
    },
    "settings": {
        "analysis": {
          "filter": {
          "ngram_tokenizer": {
            "token_chars": ["letter", "digit"],
            "min_gram": 3
            "max_gram": 20
          }
          },
          "analyzer": {
            "ngram_analyzer": {
              "tokenizer": "ngram_tokenizer"
            }
          }
        }
    }
}

Search for Newyork or Newyorkcity

GET index_name/_search
{
    "query": {
      "match": {
        "city": "Newyork"
      } 
    }
}

GET index_name/_search
{
    "query": {
      "bool": {
        "should": [
          { 
            "match": {
            "city": "Newyorkcity"
            }
          },
          { 
            "match": {
            "city.raw": "Newyorkcity"
            }
          }
        ]
      } 
    }
}

Upvotes: 0

Tom Slabbaert

Reputation: 22316

I don't recommend using ngram_analyzer as the results can be unstable as well as the massive data redundancy.

Your idea is on the right track, here is how I would do it:

Start by creating a custom analyzer using pattern char filter:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "type": 'custom',
                    "tokenizer": 'whitespace',
                    "filter": ['lowercase'],
                    "char_filter": ["my_char_replace"]
                }
            }
            "char_filter": {
                "my_city_char_filter": {
                    "type": "pattern_replace",
                    "pattern": "city",
                    "replacement": ""
                }
            }
        }
    }
}

In your mapping:

"city": {
    "type": "keyword",
    'analyzer': "my_analyzer"
    }
}

Now your data is ready to be queried simply using:

GET index/_search
{
    "query": {
        "bool": {
            "match": {
               "city": query
            }       
        }
    }
}

Upvotes: 1

Elasticserarch How to tokenize on whitespace and special word

Answers (2)

Related Questions