TAugusti
TAugusti

Reputation: 71

ignore_case doesn't work for elastic search stop token filter

I'm trying to test stop token filter that does stop words case sensitively. I try the example from elastic searche's documentation as is. But it doesn't work. Is the documentation wrong or I'm doing something wrong. https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-stop-tokenfilter.html

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true
        }
      }
    }
  }
}

Then I do

GET my-index-000001/_analyze
{
  "field": "ASCII_FIELD", 
  "text" :"this that a b The is IS was açaí à la carte"
}

I wouldn't expect either "The" or "IS" as one of the tokens. However they are present. It seems to remove the lower case stop words I add a document like this

PUT my-index-000001/_doc/1
{    
  "ASCII_FIELD" :"this that a b The is IS was  açaí à la carte"
}

I search like below and I shouldn't have gotten a hit, but I get the results back

GET my-index-000001/_search
{
  "query": {
    "match": {
      "ASCII_FIELD": "The"
    }
  }
}

Upvotes: 0

Views: 198

Answers (1)

rabbitbr
rabbitbr

Reputation: 3261

Your term have "The" look the documentation.

Documentation:

When not customized, the filter removes the following English stop words by default:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

You have two option:

Add filter lowercase:

 "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_custom_stop_words_filter"
          ]
        }
      }

OR

Add in your filter "stopwords": "english",

  "my_custom_stop_words_filter": {
          "type": "stop",
          "stopwords": "_english_",
          "ignore_case": true
        }

Test:

GET my-index-000001/_analyze
{
  "field": "ASCII_FIELD", 
  "text" :"this that a b The is IS was açaí à la carte"
}

Tokens:

{
  "tokens": [
    {
      "token": "b",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 3
    },
    {
      "token": "açaí",
      "start_offset": 28,
      "end_offset": 32,
      "type": "word",
      "position": 8
    },
    {
      "token": "à",
      "start_offset": 33,
      "end_offset": 34,
      "type": "word",
      "position": 9
    },
    {
      "token": "la",
      "start_offset": 35,
      "end_offset": 37,
      "type": "word",
      "position": 10
    },
    {
      "token": "carte",
      "start_offset": 38,
      "end_offset": 43,
      "type": "word",
      "position": 11
    }
  ]
}

Upvotes: 1

Related Questions