Rob
Rob

Reputation: 877

Exclude certain tokens from Elasticsearch's lowercase filter

I'd like all words to be indexed as lowercased tokens, except for a select few. I thought I could accomplish this using the condition token filter in combination with the lowercase filter:

Base off my reading of this page in the docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-condition-tokenfilter.html

I added this filter, to exempt the word "WHO":

{
   "filter":{
      "smart_lowercase_filter":{
         "filter":[
            "lowercase"
         ],
         "type":"condition",
         "script":{
            "source":"token.term != 'WHO'"
         }
      }
   }
}

However, "WHO" still gets tokenized as "who". Any idea what I'm doing wrong?

Many thanks.

Upvotes: 0

Views: 287

Answers (1)

Val
Val

Reputation: 217304

You need to use the CharSequence.toString() method, otherwise you compare a CharSequence with a String and that doesn't work.

{
  "settings": {
    "analysis": {
      "filter": {
        "smart_lowercase_filter": {
          "filter": [
            "lowercase"
          ],
          "type": "condition",
          "script": {
            "source": "token.term.toString() != 'WHO'"
                                     ^
                                     |
                                  add this
          }
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "smart_lowercase_filter"
          ]
        }
      }
    }
  }
}

And you'll get this:

{
  "tokens" : [
    {
      "token" : "hey",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "WHO",                  <------------
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "are",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    }
  ]
}

Upvotes: 3

Related Questions