Reputation: 71
I'm trying to test stop token filter that does stop words case sensitively. I try the example from elastic searche's documentation as is. But it doesn't work. Is the documentation wrong or I'm doing something wrong. https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-stop-tokenfilter.html
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "my_custom_stop_words_filter" ]
}
},
"filter": {
"my_custom_stop_words_filter": {
"type": "stop",
"ignore_case": true
}
}
}
}
}
Then I do
GET my-index-000001/_analyze
{
"field": "ASCII_FIELD",
"text" :"this that a b The is IS was açaí à la carte"
}
I wouldn't expect either "The" or "IS" as one of the tokens. However they are present. It seems to remove the lower case stop words I add a document like this
PUT my-index-000001/_doc/1
{
"ASCII_FIELD" :"this that a b The is IS was açaí à la carte"
}
I search like below and I shouldn't have gotten a hit, but I get the results back
GET my-index-000001/_search
{
"query": {
"match": {
"ASCII_FIELD": "The"
}
}
}
Upvotes: 0
Views: 198
Reputation: 3261
Your term have "The" look the documentation.
Documentation:
When not customized, the filter removes the following English stop words by default:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
You have two option:
Add filter lowercase:
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"my_custom_stop_words_filter"
]
}
}
OR
Add in your filter "stopwords": "english",
"my_custom_stop_words_filter": {
"type": "stop",
"stopwords": "_english_",
"ignore_case": true
}
Test:
GET my-index-000001/_analyze
{
"field": "ASCII_FIELD",
"text" :"this that a b The is IS was açaí à la carte"
}
Tokens:
{
"tokens": [
{
"token": "b",
"start_offset": 12,
"end_offset": 13,
"type": "word",
"position": 3
},
{
"token": "açaí",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 8
},
{
"token": "à",
"start_offset": 33,
"end_offset": 34,
"type": "word",
"position": 9
},
{
"token": "la",
"start_offset": 35,
"end_offset": 37,
"type": "word",
"position": 10
},
{
"token": "carte",
"start_offset": 38,
"end_offset": 43,
"type": "word",
"position": 11
}
]
}
Upvotes: 1