Elasticsearch analyzer configuration

Question

I am running term statistics in elasticsearch and I get the result:

 "tevez's": {
               "doc_freq": 165,
               "ttf": 245,
               "term_freq": 1,
               "tokens": [
                  {
                     "position": 722,
                     "start_offset": 4077,
                     "end_offset": 4084
                  }
               ],
               "score": 9.041515

How can I tell elasticsearch to consider tevez's and tevez

to be the same?

I also get:

"benched": {
               "doc_freq": 130,
               "ttf": 140,
               "term_freq": 1,
               "tokens": [
                  {
                     "position": 757,
                     "start_offset": 4292,
                     "end_offset": 4299
                  }
               ],
               "score": 9.278306

How can I tell elasticsearch to consider benched and bench to be the same?

jasonz · Accepted Answer

use possessive_english to remove 's
use porter or other stemmer to remove tenses and something else

For english, here's a full list of stemmers.

Also, you need to create the settings like:

{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "possessive": {
            "type": "stemmer",
            "language": "possessive_english"
          },
          "porter": {
            "type": "stemmer",
            "language": "english"
          }
        },
        "analyzer": {
          "custom_english": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "porter",
              "possessive"
            ]
          }
        }
      }
    }
  }
}

Finally request $endpoint/$index/_analyze?analyzer=persian_keyword_analyzer&‌text=$text to view the stem effect.

Elasticsearch analyzer configuration

Answers (1)

Related Questions