Reputation: 331
I am currently using this elasticsearch DSL query:
{
"_source": [
"title",
"bench",
"id_",
"court",
"date"
],
"size": 15,
"from": 0,
"query": {
"bool": {
"must": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title",
"content"
]
}
},
"filter": [],
"should": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title.standard^16",
"content.standard"
]
}
}
}
},
"highlight": {
"pre_tags": [
"<tag1>"
],
"post_tags": [
"</tag1>"
],
"fields": {
"content": {}
}
}
}
Here's what's happening. If I search for I.r coelho
it returns the correct results. But, if I search for I R coelho
(without the period) then it returns a different result. How do I prevent this from happening? I want the search to behave the same even if there are extra periods, spaces, commas etc.
Mapping
{
"courts_2": {
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"content": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"verdict": {
"type": "text"
}
}
}
}
}
Settings:
{
"courts_2": {
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "19000000"
},
"number_of_shards": "5",
"provided_name": "courts_2",
"creation_date": "1581094116992",
"analysis": {
"filter": {
"my_metaphone": {
"replace": "true",
"type": "phonetic",
"encoder": "metaphone"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"my_metaphone"
],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "MZSecLIVQy6jiI6YmqOGLg",
"version": {
"created": "7010199"
}
}
}
}
}
EDIT
Here are the results for I.R coelho
from my analyzer
- {
"tokens": [
{
"token": "IR",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "KLH",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Standard analyzer:
{
"tokens": [
{
"token": "i.r",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "coelho",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Upvotes: 0
Views: 493
Reputation: 2547
the reason why you have a different behaviour when searching for I.r coelho
and I R coelho
is that you are using different analyzers on the same fields, i.e., my_analyzer
for title
and content
(must
block), and standard
(the default) for title.standard
and content.standard
(should
block).
The two analyzers generate different tokens, thus determining a different score when you're searching for I.r coelho
(e.g., 2 tokens with the standard analyzer) or I R coelho
(e.g., 3 tokens with the standard analyzer). You can test the behaviour of your analyzers by using the analyze
API (see the Elastic Documentation).
You have to decide whether this is your desired behaviour.
Updates (after requested clarifications from OP)
The results of the _analyze
query confirmed the hypothesis: the two analyzers lead to a different score contribution, and, subsequently, to different results depending on whether your query includes symbol chars or not.
If you don't want the results of your query to be affected by symbols such as dots or upper/lower case, you will need to reconsider what analyzers you want to apply. The ones currently used will never satisfy your requirements. If I understood your requirements correctly, the simple
built-in analyzer should be the right one for your use case.
In a nutshell, (1) you should consider to replace the standard
built-in analyzer with the simple
one, (2) you should decide whether you want that your query applies different scores to the hits based on different analyzers (i.e., the phonetic custom one on the value of the title
and content
fields, and the simple
one on their respective subfield).
Upvotes: 1