Getting ElasticSearch to score number of total nested hits across results (idf?) higher than tf of single hit?

Question

Forgive me if I'm munging the terminology, but I am having problems getting ES to score results in a way that makes sense for my app.

I am indexing thousands of Users with several simple fields, as well as potentially hundreds of child objects nested in the index for each user (i.e. the Book --> Pages data model). The JSON getting sent to the index looks like this:

user_id: 1
  full_name: First User
  username: firstymcfirsterton
  posts: 
   id: 2
    title: Puppies are Awesome
    tags:
     - dog house
     - dog supplies
     - dogs
     - doggies
     - hot dogs
     - dog lovers

user_id: 2
  full_name: Second User
  username: seconddude
  posts: 
   id: 3
    title: Dogs are the best
    tags:
     - dog supperiority
     - dog
   id: 4
    title: Why dogs eat?
    tags: 
     - dog diet
     - canines
   id: 5
    title: Who let the dogs out?
    tags:
     - dogs
     - terrible music

The tags are type "tags", using the "keyword" analyzer, and boosted 10. Titles are not boosted.

When I do a search for "dog", the first user has a higher score than the second user. I assume this has to do the with the tf-idf of the first user being higher. However in my app, the more posts a user that have a hit for the term ideally would come first.

I tried sorting by the number of posts, but this give junk results if the user has a lot of posts. Basically I want to sort by number of unique post hits, such that a user who has more posts that have hits will rise to the top.

How would I go about doing this. Any ideas?

imotov · Accepted Answer

First of all, I agree with @karmi and @Zach that it's important to figure out what you mean by matching posts. For simplicity sake, I will assume that a matching post has a word "dog" somewhere in it and we are not using keyword analyzer to make matching on tags and boosting more interesting.

If I understood your question correctly, you want to order users based on the number of relevant posts. It means that you need to search posts in order to find relevant posts and then use this information for your user query. It could be possible only if posts are indexed separately, which means posts have to be either child documents or nested fields.

Assuming that posts are child documents, we could prototype your data like this:

curl -XPOST 'http://localhost:9200/test-idx' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
      "user" : {
        "_source" : { "enabled" : true },
        "properties" : {
            "full_name": { "type": "string" },
            "username": { "type": "string" }
        }
      },
      "post" : {
        "_parent" : {
          "type" : "user"
        },
        "properties" : {
            "title": { "type": "string"},
            "tags": { "type": "string", "boost": 10}
        }
      }
    }
}' && echo

curl -XPUT 'http://localhost:9200/test-idx/user/1' -d '{
    "full_name": "First User",
    "username": "firstymcfirsterton"
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/user/2' -d '{
    "full_name": "Second User",
    "username": "seconddude"
}'  && echo

#Posts of the first user
curl -XPUT 'http://localhost:9200/test-idx/post/1?parent=1' -d '{
    "title": "Puppies are Awesome",
    "tags": ["dog house", "dog supplies", "dogs", "doggies", "hot dogs", "dog lovers", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/2?parent=1' -d '{
    "title": "Cats are Awesome too",
    "tags": ["cat", "cat supplies", "cats"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/3?parent=1' -d '{
    "title": "One fine day with a woof and a purr",
    "tags": ["catdog", "cartoons"]
}'  && echo

#Posts of the second user
curl -XPUT 'http://localhost:9200/test-idx/post/4?parent=2' -d '{
    "title": "Dogs are the best",
    "tags": ["dog supperiority", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/5?parent=2' -d '{
    "title": "Why dogs eat?",
    "tags": ["dog diet", "canines"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/6?parent=2' -d '{
    "title": "Who let the dogs out?",
    "tags": ["dogs", "terrible music"]
}'  && echo

curl -XPOST 'http://localhost:9200/test-idx/_refresh' && echo

We can query these data using Top Children Query. (Or in case of nested fields we could achieve similar results using Nested Query)

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "bool" : {
                "should": [
                    { "text" : { "title" : "dog" } },
                    { "text" : { "tags" : "dog" } }
                ]
            }
        },
        "score" : "sum"
    }
  }
}' && echo

This query will return the first user first because of enormous boost factor that comes from matched tags. So, it might not look like what you want, but there are a couple of simple ways of fixing it. First, we can reduce the boost factor for the tags field. 10 is really large factor especially for the field that can be repeated several times. Alternatively, we can modify the query to disregard scores of child hits completely and use the number of top matched child documents as the score instead:

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "constant_score" : {
                "query" : {            
                    "bool" : {
                        "should": [
                            { "text" : { "title" : "dog" } },
                            { "text" : { "tags" : "dog" } }
                        ]
                    }
                }
            }
        },
        "score" : "sum"
    }
  }
}' && echo

Getting ElasticSearch to score number of total nested hits across results (idf?) higher than tf of single hit?

Answers (1)

Related Questions