Reputation: 877
I'd like all words to be indexed as lowercased tokens, except for a select few. I thought I could accomplish this using the condition
token filter in combination with the lowercase
filter:
Base off my reading of this page in the docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-condition-tokenfilter.html
I added this filter, to exempt the word "WHO":
{
"filter":{
"smart_lowercase_filter":{
"filter":[
"lowercase"
],
"type":"condition",
"script":{
"source":"token.term != 'WHO'"
}
}
}
}
However, "WHO" still gets tokenized as "who". Any idea what I'm doing wrong?
Many thanks.
Upvotes: 0
Views: 287
Reputation: 217304
You need to use the CharSequence.toString()
method, otherwise you compare a CharSequence
with a String
and that doesn't work.
{
"settings": {
"analysis": {
"filter": {
"smart_lowercase_filter": {
"filter": [
"lowercase"
],
"type": "condition",
"script": {
"source": "token.term.toString() != 'WHO'"
^
|
add this
}
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"smart_lowercase_filter"
]
}
}
}
}
}
And you'll get this:
{
"tokens" : [
{
"token" : "hey",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "WHO", <------------
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "are",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "you",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 3
}
]
}
Upvotes: 3