Tao Liu
Tao Liu

Reputation: 35

how to build a backward edge n-gram tokenizer

I only see n-gram and edge n-gram, both of them start from the first letter. I would like to create some tokenizer which can produce the following tokens.

For example: 600140 -> 0, 40, 140, 0140, 00140, 600140

Upvotes: 2

Views: 1141

Answers (1)

Val
Val

Reputation: 217274

You can leverage the reverse token filter twice coupled with the edge_ngram one:

PUT reverse
{
  "settings": {
    "analysis": {
      "analyzer": {
        "reverse_edgengram": {
          "tokenizer": "keyword",
          "filter": [
            "reverse",
            "edge",
            "reverse"
          ]
        }
      },
      "filter": {
        "edge": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "string_field": {
        "type": "text",
        "analyzer": "reverse_edgengram"
      }
    }
  }
}

Then you can test it:

POST reverse/_analyze
{
  "analyzer": "reverse_edgengram",
  "text": "600140"
}

Which yields this:

{
  "tokens" : [
    {
      "token" : "40",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "140",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "0140",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "00140",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "600140",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    }
  ]
}

Upvotes: 4

Related Questions