Reputation: 13
I'm basically trying to disable the lowercase filter to be able to do case sensitive matching on text fields. Following the index, and analyzer docs I create the following mapping without the lowercase filter:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"asciifolding"
]
}
}
}
}
}
I enable fielddata so I can inspect the tokenization afterward
PUT my_index/_mapping/_doc
{
"properties": {
"my_field": {
"type": "text",
"fielddata": true
}
}
}
I test the custom analyzer to make sure it doesn't lowercase, as expected
POST /my_index/analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà Vu</b>?"
}
which gets the following response
{
"tokens": [
{
"token": "Is",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "this",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "deja",
"start_offset": 11,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "Vu",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
}
]
}
Great, things are not getting lowercased just like I wanted. So now I try inserting the same text and see what happens.
POST /my_index/_doc
{
"my_field": "Is this <b>déjà Vu</b>?"
}
and try querying back for it
POST /my_index/_search
{
"query": {
"regexp": {
"my_field": "Is.*"
}
},
"docvalue_fields": [
"my_field"
]
}
and get no hits. Now if I try lowercasing the regex, I get
POST /my_index/_search
{
"query": {
"regexp": {
"my_field": "is.*"
}
},
"docvalue_fields": [
"my_field"
]
}
which returns
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "6d6PP20BXDCQSINU0RC_",
"_score": 1,
"_source": {
"my_field": "Is this <b>déjà Vu</b>?"
},
"fields": {
"my_field": [
"b",
"déjà",
"is",
"this",
"vu"
]
}
}
]
}
}
So it seems to me like things are still getting lowercased somewhere since only the lowercase regex matches and the docvalues all come back lower cased. What am I doing wrong here?
Upvotes: 1
Views: 178
Reputation: 217514
Good start so far!!!
The only issue is that you're not applying your custom analyzer to your field. Change your mapping to this and it's going to get you further.
PUT my_index/_mapping/_doc
{
"properties": {
"my_field": {
"type": "text",
"fielddata": true,
"analyzer": "my_custom_analyzer" <-- add this
}
}
}
Upvotes: 1