Reputation: 1937
Forgive me if I'm munging the terminology, but I am having problems getting ES to score results in a way that makes sense for my app.
I am indexing thousands of Users with several simple fields, as well as potentially hundreds of child objects nested in the index for each user (i.e. the Book --> Pages data model). The JSON getting sent to the index looks like this:
user_id: 1
full_name: First User
username: firstymcfirsterton
posts:
id: 2
title: Puppies are Awesome
tags:
- dog house
- dog supplies
- dogs
- doggies
- hot dogs
- dog lovers
user_id: 2
full_name: Second User
username: seconddude
posts:
id: 3
title: Dogs are the best
tags:
- dog supperiority
- dog
id: 4
title: Why dogs eat?
tags:
- dog diet
- canines
id: 5
title: Who let the dogs out?
tags:
- dogs
- terrible music
The tags are type "tags", using the "keyword" analyzer, and boosted 10. Titles are not boosted.
When I do a search for "dog", the first user has a higher score than the second user. I assume this has to do the with the tf-idf of the first user being higher. However in my app, the more posts a user that have a hit for the term ideally would come first.
I tried sorting by the number of posts, but this give junk results if the user has a lot of posts. Basically I want to sort by number of unique post hits, such that a user who has more posts that have hits will rise to the top.
How would I go about doing this. Any ideas?
Upvotes: 3
Views: 1162
Reputation: 30163
First of all, I agree with @karmi and @Zach that it's important to figure out what you mean by matching posts. For simplicity sake, I will assume that a matching post has a word "dog" somewhere in it and we are not using keyword analyzer to make matching on tags and boosting more interesting.
If I understood your question correctly, you want to order users based on the number of relevant posts. It means that you need to search posts in order to find relevant posts and then use this information for your user query. It could be possible only if posts are indexed separately, which means posts have to be either child documents or nested fields.
Assuming that posts are child documents, we could prototype your data like this:
curl -XPOST 'http://localhost:9200/test-idx' -d '{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"mappings" : {
"user" : {
"_source" : { "enabled" : true },
"properties" : {
"full_name": { "type": "string" },
"username": { "type": "string" }
}
},
"post" : {
"_parent" : {
"type" : "user"
},
"properties" : {
"title": { "type": "string"},
"tags": { "type": "string", "boost": 10}
}
}
}
}' && echo
curl -XPUT 'http://localhost:9200/test-idx/user/1' -d '{
"full_name": "First User",
"username": "firstymcfirsterton"
}' && echo
curl -XPUT 'http://localhost:9200/test-idx/user/2' -d '{
"full_name": "Second User",
"username": "seconddude"
}' && echo
#Posts of the first user
curl -XPUT 'http://localhost:9200/test-idx/post/1?parent=1' -d '{
"title": "Puppies are Awesome",
"tags": ["dog house", "dog supplies", "dogs", "doggies", "hot dogs", "dog lovers", "dog"]
}' && echo
curl -XPUT 'http://localhost:9200/test-idx/post/2?parent=1' -d '{
"title": "Cats are Awesome too",
"tags": ["cat", "cat supplies", "cats"]
}' && echo
curl -XPUT 'http://localhost:9200/test-idx/post/3?parent=1' -d '{
"title": "One fine day with a woof and a purr",
"tags": ["catdog", "cartoons"]
}' && echo
#Posts of the second user
curl -XPUT 'http://localhost:9200/test-idx/post/4?parent=2' -d '{
"title": "Dogs are the best",
"tags": ["dog supperiority", "dog"]
}' && echo
curl -XPUT 'http://localhost:9200/test-idx/post/5?parent=2' -d '{
"title": "Why dogs eat?",
"tags": ["dog diet", "canines"]
}' && echo
curl -XPUT 'http://localhost:9200/test-idx/post/6?parent=2' -d '{
"title": "Who let the dogs out?",
"tags": ["dogs", "terrible music"]
}' && echo
curl -XPOST 'http://localhost:9200/test-idx/_refresh' && echo
We can query these data using Top Children Query. (Or in case of nested fields we could achieve similar results using Nested Query)
curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
"query": {
"top_children" : {
"type": "post",
"query" : {
"bool" : {
"should": [
{ "text" : { "title" : "dog" } },
{ "text" : { "tags" : "dog" } }
]
}
},
"score" : "sum"
}
}
}' && echo
This query will return the first user first because of enormous boost factor that comes from matched tags. So, it might not look like what you want, but there are a couple of simple ways of fixing it. First, we can reduce the boost factor for the tags field. 10 is really large factor especially for the field that can be repeated several times. Alternatively, we can modify the query to disregard scores of child hits completely and use the number of top matched child documents as the score instead:
curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
"query": {
"top_children" : {
"type": "post",
"query" : {
"constant_score" : {
"query" : {
"bool" : {
"should": [
{ "text" : { "title" : "dog" } },
{ "text" : { "tags" : "dog" } }
]
}
}
}
},
"score" : "sum"
}
}
}' && echo
Upvotes: 2