baldraider
baldraider

Reputation: 1069

custom tokenizer without using built-in token filters

How to create a custom tokenizer without using default built-in token filters?. e.g: Text: "Samsung Galaxy S9" I want to tokenize this text such that it should be indexed like this

["samsung", "galaxy", "s9", "samsung galaxy s9", "samsung s9", "samsung galaxy" , "galaxy s9"].

How would I do that?

Upvotes: 2

Views: 259

Answers (1)

Alkis Kalogeris
Alkis Kalogeris

Reputation: 17735

PUT testindex
{
  "settings": {
    "analysis": {
      "filter": {
        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 20,
          "min_shingle_size": 2,
          "output_unigrams": "true"
        }
      },
      "analyzer": {
        "analyzer_shingle": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "filter_shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "title": {
          "analyzer": "analyzer_shingle",
          "search_analyzer": "standard", 
          "type": "text"
        }
      }
    }
  }
}

POST testindex/product/1
{
  "title": "Samsung Galaxy S9"
}

GET testindex/_analyze
{
  "analyzer": "analyzer_shingle",
  "text": ["Samsung Galaxy S9"]
}

You can find more about the shingles here and here

The first example is great and it covers a lot. If you want to use the standard tokenizer and not the whitespace, then you'll have to take care of the stop words as the blog post describes. Both of the urls are official ES sources

Upvotes: 2

Related Questions