Reputation: 879
In Elasticsearch tokenize all words such that if a string is flurry's(ending with a apostrophe) then i want to tokenize it as flurry's, flurry and flurrys. But if i have any special character including apostrophe(not ending with s like above) then i want to use my word delimiter e.g see below
S'sode = S, sode, Ssode, S'sode OR S-sode = S, sode, Ssode, S-sode
My word delimiter is simply working fine but it is not working for only above case where a string end with an apostrophe and s. My word delimiter is given below
"my_word_delimiter" : {
"type" : "word_delimiter",
"preserve_original": true,
"catenate_all": true,
"split_on_case_change": true,
"stem_english_possessive": false
}
.I have used word delimiter filter earlier, but it is also considering s and i do not want single s in my tokenized string also i have used comma stemmer but there i didn't get flurry's and flurrys.
Can anyone please tell me how can i do that i don't have much idea about elasticsearch.
Till now with the help of Ketty's answer and mixing it with my word delimiter there is only one point i am stuck at how to tell word delimiter to not tokenize the string ending with 's my code is given below
"settings": {
"analysis": {
"char_filter": {
"test": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)'s)\\b",
"replacement": "$1 $2 $2s"
}
},
"analyzer": {
"apostrophe_analyzer": {
"tokenizer": "whitespace",
"char_filter" : ["test"],
"filter" : [ "my_word_delimiter", "lowercase"]
}
},
"filter":{
"my_word_delimiter" : {
"type" : "word_delimiter",
"preserve_original": true,
"catenate_all": true,
"split_on_case_change": true,
"stem_english_possessive": false
}
}
}
},
Upvotes: 2
Views: 503
Reputation: 52368
I suggest the following analyzer:
"analysis": {
"char_filter": {
"test": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)'s)\\b",
"replacement": "$1 $2 $2s"
}
},
"filter": {
"my_word_delimiter": {
"type": "word_delimiter",
"preserve_original": true,
"catenate_all": true,
"split_on_case_change": true,
"stem_english_possessive": true
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"my_word_delimiter"
],
"char_filter" : ["test"],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
Upvotes: 2
Reputation: 17441
One way to achieve this is using the pattern-replace char filter.
Example:
put test
{
"settings": {
"analysis": {
"char_filter": {
"test": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)'s)\\b",
"replacement": "$1 $2 $2s"
}
}
}
}
}
get test/_analyze?tokenizer=standard&char_filter=test&text=this is flurry's test
Result:
{
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "flurry's",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "flurry",
"start_offset": 15,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "flurrys",
"start_offset": 15,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "test",
"start_offset": 17,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 5
}
]
}
Upvotes: 2