Reputation:
I created a very simple test index consisting on the following 5 entries:
{ "tags": [ { "topics": "music festival dance techno germany"} ]}
{ "tags": [ { "topics": "music festival dance techno"} ]}
{ "tags": [ { "topics": "music festival dance"} ]}
{ "tags": [ { "topics": "music festival"} ]}
{ "tags": [ { "topics": "music"} ]}
Then I performed the following query:
{
"query": {
"bool": {
"should": [
{ "match": { "tags.topics": "music festival"}}
]
}
}
}
Expecting to obtain the following order in the results:
1) "music festival"
2) "music festival dance"
3) "music festival dance techno"
4) "music festival dance techno germany"
5) "music"
Accounting for field-length normalization.
However I got the following:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.5753642,
"hits": [
{
"_index": "testindex",
"_type": "entry",
"_id": "1",
"_score": 0.5753642,
"_source": {
"tags": [
{
"topics": "music festival dance techno germany"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "3",
"_score": 0.5753642,
"_source": {
"tags": [
{
"topics": "music festival dance"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "4",
"_score": 0.42221835,
"_source": {
"tags": [
{
"topics": "music festival"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "2",
"_score": 0.32088596,
"_source": {
"tags": [
{
"topics": "music festival dance techno"
}
]
}
},
{
"_index": "testindex",
"_type": "entry",
"_id": "5",
"_score": 0.2876821,
"_source": {
"tags": [
{
"topics": "music"
}
]
}
}
]
}
}
Whose order seems absolutely random, except for the lowest score that matched only one word.
What could be causing this and, what could I change (during mapping, indexing or searching), to get the expected order?
Note: The same goes for non-perfect matching queries. Searching "music dance" should still produce the 3 word entry as a first result, so using or boosting term queries seems out of the question.
Upvotes: 3
Views: 1841
Reputation: 4535
As I described in this answer scoring/relevance is not the easiest topic in Elasticsearch.
I was trying to figure out solution for you and currently I have something like that.
Documents:
{ "tags": [ { "topics": ["music", "festival", "dance", "techno", "germany"]} ], "topics_count": 5 }
{ "tags": [ { "topics": ["music", "festival", "dance", "techno"]} ], "topics_count": 4 }
{ "tags": [ { "topics": ["music", "festival", "dance"] } ], "topics_count": 3 }
{ "tags": [ { "topics": ["music", "festival"]} ], "topics_count": 2 }
{ "tags": [ { "topics": ["music"]} ], "topics_count": 1 }
and query:
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"terms_set": {
"tags.topics" : {
"terms" : ["music", "festival"],
"minimum_should_match_script": {
"source": "params.num_terms"
}
}
}
},
"script_score" : {
"script" : {
"source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
}
}
}
},
{
"function_score": {
"query": {
"terms_set": {
"tags.topics" : {
"terms" : ["music", "festival"],
"minimum_should_match_script": {
"source": "doc['topics_count'].value"
}
}
}
},
"script_score" : {
"script" : {
"source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
}
}
}
}
]
}
}
}
It is not perfect. Still needs some improvements. It works well (tested on ES 6.2) for ["music", "festival"]
and ["music", "dance"]
on this example but I'm guessing that on other results it will work not 100% as you expected. Mostly because of the relevance/scoring complexity. But you can now read more about things I used and try to improve it.
Upvotes: 2