Prakash Kumar
Prakash Kumar

Reputation: 879

Tokenize a string in elasticsearch?

In Elasticsearch tokenize all words such that if a string is flurry's(ending with a apostrophe) then i want to tokenize it as flurry's, flurry and flurrys. But if i have any special character including apostrophe(not ending with s like above) then i want to use my word delimiter e.g see below

S'sode = S, sode, Ssode, S'sode OR S-sode = S, sode, Ssode, S-sode

My word delimiter is simply working fine but it is not working for only above case where a string end with an apostrophe and s. My word delimiter is given below

"my_word_delimiter" : {
        "type" : "word_delimiter",
        "preserve_original": true,
        "catenate_all": true,
        "split_on_case_change": true,
        "stem_english_possessive": false
 }

.I have used word delimiter filter earlier, but it is also considering s and i do not want single s in my tokenized string also i have used comma stemmer but there i didn't get flurry's and flurrys.

Can anyone please tell me how can i do that i don't have much idea about elasticsearch.

Till now with the help of Ketty's answer and mixing it with my word delimiter there is only one point i am stuck at how to tell word delimiter to not tokenize the string ending with 's my code is given below

"settings": {
  "analysis": {
     "char_filter": {
        "test": {
           "type": "pattern_replace",
           "pattern": "\\b((\\w+)'s)\\b",
           "replacement": "$1 $2 $2s"
        }
     },
     "analyzer": {
              "apostrophe_analyzer": {
                    "tokenizer": "whitespace",
                    "char_filter" : ["test"],
                    "filter" : [ "my_word_delimiter", "lowercase"]
              }
     },
     "filter":{
            "my_word_delimiter" : {
               "type" : "word_delimiter",
               "preserve_original": true,
               "catenate_all": true,
               "split_on_case_change": true,
               "stem_english_possessive": false
            }
     }
  }

},

Upvotes: 2

Views: 503

Answers (2)

Andrei Stefan
Andrei Stefan

Reputation: 52368

I suggest the following analyzer:

"analysis": {
  "char_filter": {
    "test": {
       "type": "pattern_replace",
       "pattern": "\\b((\\w+)'s)\\b",
       "replacement": "$1 $2 $2s"
    }
 },
  "filter": {
    "my_word_delimiter": {
      "type": "word_delimiter",
      "preserve_original": true,
      "catenate_all": true,
      "split_on_case_change": true,
      "stem_english_possessive": true
    }
  },
  "analyzer": {
    "my_analyzer": {
      "filter": [
        "my_word_delimiter"
      ],
      "char_filter" : ["test"],
      "type": "custom",
      "tokenizer": "whitespace"
    }
  }
}

Upvotes: 2

keety
keety

Reputation: 17441

One way to achieve this is using the pattern-replace char filter.

Example:

 put test 
{
   "settings": {
      "analysis": {
         "char_filter": {
            "test": {
               "type": "pattern_replace",
               "pattern": "\\b((\\w+)'s)\\b",
               "replacement": "$1 $2 $2s"
            }
         }
      }
   }
}




get test/_analyze?tokenizer=standard&char_filter=test&text=this is flurry's test

Result:

 {
   "tokens": [
      {
         "token": "this",
         "start_offset": 0,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "is",
         "start_offset": 5,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "flurry's",
         "start_offset": 8,
         "end_offset": 15,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "flurry",
         "start_offset": 15,
         "end_offset": 15,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "flurrys",
         "start_offset": 15,
         "end_offset": 16,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "test",
         "start_offset": 17,
         "end_offset": 21,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}

Upvotes: 2

Related Questions