Reputation: 1454
How can I query and sort text by below parameter in elasticsearch
1 - search query be exact in the first part of results
2 - search query be exact in another part of the result
3 - results contain all words of the search query
For example :
When I search: i love dogs
Results respectively must be :
1- I love dogs
2 - i love dogs and birds
3 - birds good but i love dogs and horses
4 - Horses and i love dogs
5 - I love horses and dogs
6 - good dogs and i love horses
Upvotes: 3
Views: 1188
Reputation: 6066
It is possible to achieve the desired behavior, but it will require quite some tweaking of your mapping and the query.
First, here's the mapping:
PUT my_phrase_search
{
"mappings": {
"doc": {
"properties": {
"expected_position": {
"type": "long"
},
"my_phrase": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "my_normalizer"
}
}
}
}
}
},
"settings": {
"index": {
"analysis": {
"normalizer": {
"my_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
}
}
}
}
Note: I added field expected_position
to make evaluation of the results easier.
Now, the query:
POST my_phrase_search/doc/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"should": [
{
"prefix": {
"my_phrase.keyword": "i love dogs"
}
}
],
"_name": "prefix",
"boost": 2
}
},
{
"bool": {
"should": [
{
"match": {
"my_phrase": "i love dogs"
}
}
],
"_name": "match"
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"my_phrase": "i love dogs"
}
}
],
"_name": "phrase",
"boost": 2
}
}
]
}
}
}
This gives the following results:
[
{
"_score": 4.015718,
"_source": {
"my_phrase": "I love dogs",
"expected_position": 1
},
"matched_queries": [
"match",
"phrase",
"prefix"
]
},
{
"_score": 3.233316,
"_source": {
"my_phrase": "i love dogs and birds",
"expected_position": 2
},
"matched_queries": [
"match",
"phrase",
"prefix"
]
},
{
"_score": 1.3836111,
"_source": {
"my_phrase": "birds good but i love dogs and horses ",
"expected_position": 3
},
"matched_queries": [
"match",
"phrase"
]
},
{
"_score": 1.2333161,
"_source": {
"my_phrase": "Horses and i love dogs",
"expected_position": 4
},
"matched_queries": [
"match",
"phrase"
]
},
{
"_score": 0.8630463,
"_source": {
"my_phrase": "I love horses and dogs",
"expected_position": 5
},
"matched_queries": [
"match"
]
},
{
"_score": 0.38110584,
"_source": {
"my_phrase": "good dogs and i love horses",
"expected_position": 6
},
"matched_queries": [
"match"
]
}
]
You may wonder, how does it work? Are all these changes necessary? Let's find out.
text
field and match
query?The match
query would look like this:
POST my_phrase/doc/_search
{
"query": {
"match": {
"my_phrase": "i love dogs"
}
}
}
This will give us the following order of the results: 5 - 1 - 3 - 2 - 4 - 6
.
The question is, why query for "i love dogs"
did not return a perfect match, 1- I love dogs
, as the first result? Why 5 - I love horses and dogs
came first?
In this case the answer is avgFieldLength
which is used for computation of the score, it is computed per shard and thus can be slightly different for different documents.
It is pretty obvious that ES should give us results that start with our query. How can we tell ES to prefer such documents?
prefix
search on keyword
fieldWe can use prefix
query united with match
query via bool
query (which can be roughly interpreted as an OR
in this case), like this:
POST my_phrase/doc/_search
{
"query": {
"bool": {
"should": [
{
"prefix": {
"my_phrase.keyword": "i love dogs"
}
},
{
"match": {
"my_phrase": "i love dogs"
}
}
]
}
}
}
Note that prefix
query only works with keyword
type, since it needs to interpret the document as one big token.
This query gives us the following order of the results: 2 - 5 - 1 - 3 - 4 - 6
.
2 jumped up, but 1 did not. Why did it happen?
Here the case of the characters comes into play: keyword
data type is not analyzed and thus i
or I
will make a difference for this prefix search.
How can we make keyword
case-insensitive?
keyword
case-insesitiveThis is achieved by defining a normalizer in the mapping:
PUT my_phrase2
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"my_phrase": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
The same query will now give us the following order: 1 - 2 - 5 - 3 - 4 - 6
.
This is already pretty good, but 5 - I love horses and dogs
is still too high – higher that 3 - birds good but i love dogs and horses
with exact phrase match.
match
query does not care about the order of words in the phrase. Can we boost the documents that have the correct order?
match_phrase
to boost phrase matchingThere is match_phrase
query that does favor tokens in the original order. Let's use it in the query:
POST my_phrase2/doc/_search
{
"query": {
"bool": {
"should": [
{
"prefix": {
"my_phrase.keyword": "i love dogs"
}
},
{
"match_phrase": {
"my_phrase": "i love dogs"
}
},
{
"match": {
"my_phrase": "i love dogs"
}
}
]
}
}
}
This gives us the following order: 1 - 2 - 3 - 5 - 4 - 6
.
3 popped up! But 5 - I love horses and dogs
is still higher than 4 - Horses and i love dogs
. Looks like phrase match should have favored the 4.
The query has become quite complex, let's find out which parts of it the documents actually matched.
It is possible to give names to queries so to understand which parts of a complex one actually took effect:
POST my_phrase2/doc/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"should": [
{
"prefix": {
"my_phrase.keyword": "i love dogs"
}
}
],
"_name": "prefix"
}
},
...
The response for the documents of interest will give us:
{
"_score": 0.8630463,
"_source": {
"my_phrase": "I love horses and dogs",
"expected_position": 5
},
"matched_queries": [
"match"
]
},
{
"_score": 0.82221067,
"_source": {
"my_phrase": "Horses and i love dogs",
"expected_position": 4
},
"matched_queries": [
"match",
"phrase"
]
},
Doc 5 did not match the phrase
part. Looks like score fluctuations hit us again.
Phrase query looks more relevant, is there a way to boost it?
There is a way to affect the computation of the score, telling ES that certain parts of the query are more important, called boost. Here's how it might look like:
POST my_phrase2/doc/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"should": [
{
"prefix": {
"my_phrase.keyword": "i love dogs"
}
}
],
"_name": "prefix",
"boost": 2
}
},
{
"bool": {
"should": [
{
"match": {
"my_phrase": "i love dogs"
}
}
],
"_name": "match"
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"my_phrase": "i love dogs"
}
}
],
"_name": "phrase",
"boost": 2
}
}
]
}
}
}
This one gives us the desired order of results: 1 - 2 - 3 - 4 - 5 - 6
.
Note that we boosted also the prefix
query because we wanted to lower the importance of match
.
Although this query does the job, you might want to perform great deal of real-world validation and further tweaking in order to assure adequate search results.
The query that fits perfectly those 6 documents might not fit a large real-world collection, please take this answer as a start for your optimization.
As you can see, not all the parts of the query are necessary: names of queries can be easily omitted, but serve as good aid in understanding how a document was matched.
Upvotes: 3
Reputation: 517
To get your desired results you need to use match_phrase_prefix
with parameters like max_expansions
example below for further read.
GET /_search
{
"query": {
"match_phrase_prefix" : {
"message" : "quick brown f"
}
}
}
Upvotes: 0