Tokenize a string in elasticsearch?

Question

In Elasticsearch tokenize all words such that if a string is flurry's(ending with a apostrophe) then i want to tokenize it as flurry's, flurry and flurrys. But if i have any special character including apostrophe(not ending with s like above) then i want to use my word delimiter e.g see below

S'sode = S, sode, Ssode, S'sode OR S-sode = S, sode, Ssode, S-sode

My word delimiter is simply working fine but it is not working for only above case where a string end with an apostrophe and s. My word delimiter is given below

"my_word_delimiter" : {
        "type" : "word_delimiter",
        "preserve_original": true,
        "catenate_all": true,
        "split_on_case_change": true,
        "stem_english_possessive": false
 }

.I have used word delimiter filter earlier, but it is also considering s and i do not want single s in my tokenized string also i have used comma stemmer but there i didn't get flurry's and flurrys.

Can anyone please tell me how can i do that i don't have much idea about elasticsearch.

Till now with the help of Ketty's answer and mixing it with my word delimiter there is only one point i am stuck at how to tell word delimiter to not tokenize the string ending with 's my code is given below

"settings": {
  "analysis": {
     "char_filter": {
        "test": {
           "type": "pattern_replace",
           "pattern": "\b((\w+)'s)\b",
           "replacement": "$1 $2 $2s"
        }
     },
     "analyzer": {
              "apostrophe_analyzer": {
                    "tokenizer": "whitespace",
                    "char_filter" : ["test"],
                    "filter" : [ "my_word_delimiter", "lowercase"]
              }
     },
     "filter":{
            "my_word_delimiter" : {
               "type" : "word_delimiter",
               "preserve_original": true,
               "catenate_all": true,
               "split_on_case_change": true,
               "stem_english_possessive": false
            }
     }
  }

},

Andrei Stefan · Accepted Answer

I suggest the following analyzer:

"analysis": {
  "char_filter": {
    "test": {
       "type": "pattern_replace",
       "pattern": "\b((\w+)'s)\b",
       "replacement": "$1 $2 $2s"
    }
 },
  "filter": {
    "my_word_delimiter": {
      "type": "word_delimiter",
      "preserve_original": true,
      "catenate_all": true,
      "split_on_case_change": true,
      "stem_english_possessive": true
    }
  },
  "analyzer": {
    "my_analyzer": {
      "filter": [
        "my_word_delimiter"
      ],
      "char_filter" : ["test"],
      "type": "custom",
      "tokenizer": "whitespace"
    }
  }
}

Tokenize a string in elasticsearch?

Answers (2)

Related Questions